Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ANALYSIS OF AND TECHNIQUES FOR PRIVACY PRESERVING DATA MINING by Songtao Guo A dissertation submitted to the faculty of The University of North Carolina at Charlotte in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Technology Charlotte 2007 Approved by: Dr. Yuliang Zheng Dr. Xintao Wu Dr. Zbigniew Ras Dr. Zongwu Cai Dr. Arun Ravindran ii c °2007 Songtao Guo ALL RIGHTS RESERVED iii ABSTRACT Songtao Guo. Analysis of and techniques in privacy preserving data mining. (Under the direction of DR. Yuliang Zheng and DR. Xintao Wu) Privacy is often considered as a social, moral or legal concept. As Internet and ecommerce have prospered nowadays, privacy has become one of the most important issues in IT and has received increasing attention from enterprises, consumers and legislators. Although various techniques, such as randomization-based methods, cryptographicbased methods, and database inference control etc. have been developed, many key problems still remain open in this area. Especially, new privacy and security issues have been identified, and the scope of the privacy has been expanded. An essential problem under the context is tradeoffs between the data utility and the disclosure risk. Since previous research only conducted empirical evaluations or limited analysis for existing randomization techniques, a more solid theoretical analysis is needed. This dissertation investigates different perturbation models in randomization-based privacy preserving data mining. Among them, the additive-noise-based model and the projection-based-model are primary tools. For the additive-noise-based perturbation, the explicit relation between noise and mining accuracy has not been carefully studied. We first propose an improved strategy to reconstruct the data based on the representative method. Then we develop explicit bounds of reconstruction error. Both the upper bound and the lower bound provide a guideline to balance the privacy/accuracy tradeoff. We also discuss other potential threats to the privacy based on our defined measure for quantifying the privacy. For the projection-based perturbation, properties of different models and possible disclosures within those models are analyzed in detail. Particularly, we propose an A-priori Knowledge-based ICA attack (AK-ICA) which is effective against all the existing projection models. Due to the vulnerabilities in previous randomization models, a general-location-modelbased approach is proposed. It first builds a statistical model to fit the real data with both categorical and numerical types of variables, then generates a synthetic data set for mining by tuning parameters of the model instead of perturbing particular individual iv values. Since the search space of parameters of the model is much smaller than that of data and all information which attackers can derive is contained in those parameters, this approach is expected to be more effective and efficient. This dissertation investigates privacy issues of the numerical data in this model, wherein the disclosure is analyzed and controlled in different scenarios. v ACKNOWLEDGMENTS In my way of pursuing this Ph.D. degree in the past years, so many people contributed a lot, in many different ways, to make my success as a part of their own. I am so excited of this moment, the moment I could publicly express my thanks to all of them. First of all, I would like to thank my advisors Dr. Yuliang Zheng and Dr. Xintao Wu. I am fortunate to be taken as a student of them. It was them who gave me great support, understanding, and encouragement during my study. They teach me to be pro-active in thinking, learning and living an integrated life. I wish to express my deepest gratitude to Dr. Wu for his continuous guidance and support in my research. Without his intellectual input, I would not complete my doctoral research. Thanks are also due to Dr. Zbigniew Ras, Dr. Zongwu Cai and Dr. Arun Ravindran, for serving as my committee members and giving me precious advice. They are very supportive during my qualify exam and dissertation proposal. In addition to my committee, I am thankful to my co-author, Dr. Yingjiu Li from Singapore Management University for the fruitful collaboration during my research. Truly gratitude also goes to all those at the KDD Laboratory and the Laboratory of Information and Infrastructure Security, past and present, for friendship and camaraderie. In particular, I thank Jing Jing, Ling Guo, Hangzai Luo, Yuli Gao, Dichao Peng, Peng Tang, Yong Ye and Xiaowei Ying for the mind-sparking discussions and suggestions at different phases of this dissertation. I thank my dear friend Gao Zhang for introducing me to this school and sharing memorable experience with me over the years. I have been blessed with some great friends who have always been there to share in difficult and joyous occasions. To Xiaobin You, Peiqin Zhang, Shan Xie, Guodong Jiao, Wujian Xue, Zixian Wang, Xiaoran Wu, Qiang Shi, Yunfeng Sui, Alex Xiao, Su Dong, and Dingxiang Liu. I also extend my thanks to Jane and Wayne who have welcomed me into their home and treated me like their child. I feel most indebted to my parents who gave me the important education in life. vi Thank my sister, Kelan, for her unwavering support and prayers. They always stood by me whenever I needed them the most. Without their love and support from thousand miles away, I would not be able to come this far. I would like to dedicate this dissertation to my family. My research was supported by U.S. NSF Grant IIS-0546027 and NSF Grant CCR0310974. I was also supported by the Department of Software and Information System. vii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTER 1 : INTRODUCTION . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Statement . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . 3 1.4 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . 4 CHAPTER 2 : 2.1 BACKGROUND AND RESEARCH ISSUES . . . . . . . . . 6 Privacy Preserving Data Mining and its Applications. . . . . . . . . . 6 2.1.1 Secure Multi-Party Computation . . . . . . . . . . . . . . 7 2.1.2 Data Randomization . . . . . . . . . . . . . . . . . . . 8 Additive-Noise-Based Perturbation . . . . . . . . . . . . . . . . . 9 Projection-Based Perturbation . . . . . . . . . . . . . . . . . . . . 10 Randomized Response . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Data Imputation and Synthesis . . . . . . . . . . . . . . . 12 2.2 Data Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Data Suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Data Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Research Issues in Preserving Privacy for Numerical Data. . . . . . . . 14 2.2.1 Issues in Additive-Noise-Based Perturbation . . . . . . . . . 14 2.2.2 Issues in Projective-Transformation-Based Perturbation. . . . . 15 2.2.3 Issues in Model-Based Privacy Preserving Data Mining . . . . . 16 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 CHAPTER 3 : DISCLOSURE ANALYSIS OF THE ADDITIVE-NOISE-BASED PERTURBATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1 Additive-Noise-Based Perturbation Model . . . . . . . . . . . . . . 18 viii 3.2 Data Reconstruction Attacks . . . . . . . . . . . . . . . . . . . . 20 3.2.1 Spectral Filtering Method . . . . . . . . . . . . . . . . . 20 3.2.2 PCA-Based Reconstruction Method . . . . . . . . . . . . . 21 3.2.3 MLE-Based Reconstruction Method . . . . . . . . . . . . . 22 3.2.4 Privacy Issues . . . . . . . . . . . . . . . . . . . . . . 23 3.3 An Improved Strategy for Noise Filtering. . . . . . . . . . . . . . . 25 3.4 Upper Bound Analysis. . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Lower Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . 37 3.5.1 SVD-based Reconstruction Method . . . . . . . . . . . . . 37 3.5.2 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 39 3.5.3 Equivalence of Two Reconstruction Methods . . . . . . . . . 40 3.6 Potential Attack Based on Distribution . . . . . . . . . . . . . . . 41 3.6.1 Quantification of Privacy . . . . . . . . . . . . . . . . . . 42 3.6.2 Extension to Multiple Confidential Attributes . . . . . . . . . 44 3.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.7.1 Scenario of Adding Noise. . . . . . . . . . . . . . . . . . 45 3.7.2 Effect of Varying the Number of Principal Components . . . . . 48 3.7.3 Effect of Varying Noise . . . . . . . . . . . . . . . . . . 50 3.7.4 Effect of Covariance Matrix of the Noise . . . . . . . . . . . 51 3.7.5 Utility. . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.7.6 Lower Bound vs. Privacy Threshold . . . . . . . . . . . . . 53 3.7.7 Evaluation of IQR Attack . . . . . . . . . . . . . . . . . 54 3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 CHAPTER 4 : DISCLOSURE ANALYSIS OF THE PROJECTION-BASED PERTURBATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1 Projection-Based Perturbation Models . . . . . . . . . . . . . . . . 62 4.1.1 Distance-Preserving-Based Projection . . . . . . . . . . . . 62 4.1.2 Non-Distance-Preserving-Based Projection . . . . . . . . . . 65 4.1.3 The General-Linear-Transformation-Based Perturbation . . . . 67 ix 4.2 Direct Attack . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2.1 ICA Revisited . . . . . . . . . . . . . . . . . . . . . . 68 4.2.2 Drawbacks of Direct ICA. . . . . . . . . . . . . . . . . . 70 4.3 Sample-Based Attack . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.1 Attacks for Distance-Preserving-Based Projection . . . . . . . 71 Known-Sample-Based Regression Attack . . . . . . . . . . . . . . 71 Known-Sample-Based PCA Attack . . . . . . . . . . . . . . . . . 72 4.3.2 Attacks for Non-Distance-Preserving-Based Projection . . . . . 72 AK-ICA Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Existence of Transformation Matrix J . . . . . . . . . . . . . . . 74 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Determining J 4.3.3 Attacks for General Projection . . . . . . . . . . . . . . . 79 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.1 Effect of Noise and the Transformation Matrix . . . . . . . . 82 4.4.2 Effect of the Sample Size . . . . . . . . . . . . . . . . . . 84 4.4.3 Comparing AK-ICA and Known-Sample-Based PCA Attack. . . 87 4.4.4 Comparing AK-ICA and Spectral-Filtering-Based Attack . . . . 88 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 CHAPTER 5 : DISCLOSURE ANALYSIS OF THE MODEL-BASED PRIVACY PRESERVING APPROACH . . . . . . . . . . . . . . . . . . . . . . . . 92 5.1 The General Location Model Revisited . . . . . . . . . . . . . . . . 94 5.2 Disclosure Controls For Numerical Data . . . . . . . . . . . . . . . 95 5.2.1 Basic Disclosure Scenario. . . . . . . . . . . . . . . . . . 96 5.2.2 Conditional Scenario . . . . . . . . . . . . . . . . . . . 101 5.2.3 Combination Scenario . . . . . . . . . . . . . . . . . . . 102 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 CHAPTER 6 : CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . 105 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 106 x 6.3 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 107 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 xi LIST OF TABLES TABLE 1.1: Personal information of n customers . . . . . . . . . . . . . . . TABLE 3.1: The relative error re(X, X̂) vs. varying E under three scenarios 2 (Type 1, 2, and 3) for the PATTERNS data set. The values with ∗ denote the results following Strategy 2, while the values with † denote the results following the Strategy 1. The bold values indicate those best estimations achieved by the Spectral Filtering technique. TABLE 3.2: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 The relative error re(X, X̂) vs. varying E under three scenarios (Type 1, 2, and 3) for the ADULT data set. . . . . . . . . . . TABLE 3.3: Utility of Reconstructed Adult Data with Type 1 Noise. . . . . TABLE 3.4: Stock/bonds from Bank data set with Uniform noise [-125,125], disclosure with 95% IQR, information loss for AS is 14.6% . . . 47 53 55 TABLE 3.5: Sinusoidal with Gaussian noise (0,8) using AS and SF methods 57 TABLE 4.1: Reconstruction error vs. SNR for four cases when k = 1000 . . . 83 TABLE 4.2: Reconstruction error vs. sample size(k) when Y = RX 85 TABLE 4.3: Reconstruction error of AK-ICA vs. PCA attacks by varying R TABLE 4.4: Reconstruction error vs. SNR for spectral filtering method when Y =X +E . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 91 xii LIST OF FIGURES FIGURE 3.1: Distribution Reconstruction Algorithm . . . . . . . . . . . . . . 19 FIGURE 3.2: Spectral Filtering Process . . . . . . . . . . . . . . . . . . . . . 21 FIGURE 3.3: PCA-Based Reconstruction Method . . . . . . . . . . . . . . . . 22 FIGURE 3.4: MLE-Based Reconstruction Method . . . . . . . . . . . . . . . . 23 FIGURE 3.5: SVD Based Reconstruction Algorithm 38 FIGURE 3.6: Reconstruction accuracy (data distribution for attribute 2) vs. varying k with σ 2 = 0.5 FIGURE 3.7: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Reconstruction accuracy (point-wise data distribution for attribute 2) with best k vs. varying noise magnitude . . . . . . . . . . . . FIGURE 3.8: 49 51 Reconstruction accuracy (data distribution for attribute 2) with kEk = 323 under three cases . . . . . . . . . . . . . . . . . . . . 52 Utility vs varying noise with type 1 . . . . . . . . . . . . . . . . 53 FIGURE 3.10: Utility vs. varying noises of three types . . . . . . . . . . . . . . 54 FIGURE 3.11: Achieved Reconstruction accuracy vs. varying privacy threshold τ 55 FIGURE 3.9: FIGURE 3.12: Reconstructed stock/bands from bank data set using AS algorithm. The noise is Uniform distribution [-125,125]. . . . . . . . 56 FIGURE 3.13: Disclosure of Bank distribution with Uniform noise (AS algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 FIGURE 3.14: Disclosure analysis on Sinusoidal with Gaussian noise (0,8) using AS and SF methods) . . . . . . . . . . . . . . . . . . . . . . . . 58 FIGURE 3.15: Stock/Bonds of Bank data set perturbed using Uniform distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FIGURE 4.1: Example of rotation-based perturbation FIGURE 4.2: Known-Sample-Based PCA Attack 59 . . . . . . . . . . . . . 64 . . . . . . . . . . . . . . . . 73 FIGURE 4.3: AK-ICA Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 FIGURE 4.4: Distribution of component . . . . . . . . . . . . . . . . . . . . . 78 FIGURE 4.5: The effect of noise E for RE . . . . . . . . . . . . . . . . . . . . 84 xiii FIGURE 4.6: Reconstruction error vs. varying known sample size k under Y = RX FIGURE 4.7: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Reconstruction error vs. random samples with the fixed size k = 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FIGURE 4.8: Reconstruction error of AK-ICA vs. PCA attacks by varying R FIGURE 4.9: Reconstruction error vs. SNR for SF and AK-ICA (with fixed size 86 88 k = 1000) when Y = X + E . . . . . . . . . . . . . . . . . . . . 90 FIGURE 5.1: A constant density contour for a bi-variate normal distribution . 97 FIGURE 5.2: Confidence Intervals 98 FIGURE 5.3: Density contour with varied covariance matrix . . . . . . . . . . 102 . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 1: INTRODUCTION 1.1 Motivation With the advance of the information age, data collection and data analysis have exploded both in size and complexity. The attempt to extract important patterns and trends from the vast data sets has led to a challenging field called Data Mining. When a complete data set is available, various statistical, machine learning, and data mining techniques can be applied to analyze the data. Sensitive data usually includes information regarding individuals’ physical or mental health, financial privacy etc. In the third party context, a single party (data holder) holds a collection of original individual data with privacy concerns. The data holder can utilize or release data to the third party for analysis, however, he is required not to disclose any private information. For example, one company collects its employees personal data (e.g., income, age, etc.) and needs to release this data set to the third party for analysis. Since each employee has his/her concern on the privacy of their personal data, the company should figure out ways only to release data while guaranteeing no private individual information can be derived by attackers or snoopers. Another context involves end data providers as clients and a data collector as the server. As end data providers, they would like to share their data for analysis, however, to preserve their privacy has the same importance. As a server, it mainly aims to extract patterns from the data or from the distribution of the data. The aggregate information learnt from the collection might be rich enough for its mining tasks. During the process of data collection, the individual data shall be randomized before it reaches the server. By combining parameters of the randomization and the perturbed data, aggregate statistical properties of the original data can be derived to contribute to mining tasks. In the third context, data are distributed across different sites. Traditionally, the data 2 warehousing approaches can be used to mine distributed databases. It requires that data from all the participating sites are collected at a centralized warehouse. However, many data owners may be reluctant to share their data with others due to privacy and confidentiality concerns. This is a serious impediment to perform mutually beneficial data mining tasks. Privacy-Preserving Data Mining (PPDM) has emerged to address issues under above contexts. The research of PPDM is aimed at bridging the gap between collaborative data mining and data confidentiality. It involves many areas such as statistics, computer sciences, and social sciences. It is of fundamental importance to homeland security, modern science, and to our society in general. Table 1.1: Personal information of n customers ID SSN Name Zip Race ··· Age Gender 1 2 3 4 ··· n *** *** *** *** ··· *** *** *** *** *** ··· *** 28223 28223 28262 28261 ··· 28223 Asian Asian Black White ··· Asian ··· ··· ··· ··· ··· ··· 20 30 20 26 ··· 20 M F M M ··· M Balance ($1,000) 10 15 50 45 ··· 80 Income ($1,000) 85 70 120 23 ··· 110 ··· ··· ··· ··· ··· ··· ··· Interest Paid ($1,000) 2 18 35 134 ··· 15 Table 1.1 provides an example of n customers’ original personal information which includes various attributes. Disclosures that can occur as a result of inferences by snoopers include two classes: identity disclosure and value disclosure. Identity disclosure relates to the disclosure of identities of individuals in the database while value disclosure relates to the disclosure of the value of a certain confidential attribute of those individuals. There is no doubt that identity attributes, such as SSN and Name, should be masked to protect privacy before the data is released. However, some categorical attributes, such as Zip, Race, Gender, can also be used to identify individuals by linking them to some public available data set. Those attributes hence are called as Quasi-Identifiers [Samarati 2001]. There have been a lot of research on how to prevent identity disclosure, such as the well known statistical disclosure control (SDC) methods [Adam and Wortman 1989; Malvestuto, Moscarini and Rafanelli 1991; Domingo-Ferrer and MateoSanz 2002; Domingo-Ferrer and Torra 2003], k-anonymity [Samarati and Sweeney 1998; 3 Samarati 2001; Sweeney 2002; LeFevre, DeWitt and Ramakrishnan 2006], `-diversity [Machanavajjhala et al. 2006] and t-closeness [Li, Li and Venkatasubramanian 2007]. To prevent value disclosures, various randomization based approaches (e.g., [Agrawal and Srikant 2000; Palley and Simonoff 1987; Sarathy and Muralidhar 2002; Rizvi and Haritsa 2002; Du and Zhan 2003; Oliveira and Zaiane 2004; Chen and Liu 2005; Liu, Kargupta and Ryan 2006]) have been investigated. 1.2 Research Statement After the data miners collect large amount of private data from data providers, the data might be perturbed in different ways in order to avoid the privacy disclosure, as well as to keep some useful patterns for further data mining. The focus of this dissertation is to utilize formal methods to analyze various perturbation models and explore the balance between data utility and the disclosure risk in privacy preserving data mining. Specifically, my dissertation aims to 1. analyze the accuracy of estimations for various perturbation models; 2. explore potential attacks for existing models and evaluate their performance; 3. design models to control privacy and data utility for privacy preserving applications. 1.3 Dissertation Contributions As parts of a novel framework for privacy preserving data mining, the main contributions achieved in my research are summarized as follows: 1. Bound analysis of the accuracy of value reconstruction techniques. In particular, we firstly derive one upper bound for the Frobenius norm of reconstruction errors using the matrix perturbation theory. This upper bound may be exploited by attackers to determine how close their estimates are to the original data using spectral-filteringbased method, which imposes a serious threat of privacy breaches. We then derive a lower bound for the reconstruction error, which can help data owners determine how much noise should be added to satisfy one given threshold of tolerated privacy breach. Besides, an improved data reconstruction strategy for noise filtering is also 4 given. In the context of the additive-noise-based perturbation, we develop a new strategy by comparing the benefit due to inclusion of one component with the loss due to the additional projected noise. We show that such strategy is expected to give one approximately optimal reconstruction from the perturbed data. 2. An effective attacking method to break projection-based perturbation. By combining a known small subset of the original data, which is reasonable in practice, our algorithm, AK-ICA, can effectively estimate the whole original data set with high accuracy. Other nice properties of this attacking method also include its robustness to arbitrary projection-based perturbation. All the previous perturbation methods under this context are vulnerable to our attack. Therefore, current projection-based privacy preserving data mining techniques may need a careful scrutiny in order to prevent privacy breaches when a subset of sample data is available. 3. A measure to quantify the individual privacy disclosure. We propose a way to measure how close the inter-quantile range obtained by attackers or snoopers is to individual’s privacy interval for some particular sensitive variable. We also extend such measure to multivariate case based on the confidential region. 4. Disclosure control methods for various scenarios in a model based privacy preserving data mining. General databases typically contain numerous attributes with different privacy concerns. To satisfy different privacy requirements from data providers, we analyze potential privacy disclosures in several scenarios and find ways to adjust parameters of the model learned from the underlying data. 1.4 Dissertation Organization The rest of this dissertation is organized as follows: In Chapter 2, the current research on privacy preserving data mining is briefly reviewed. Various models in the randomization-based Privacy Preserving Data Mining, including additive-noise-based perturbation, projection-based perturbation, randomized response, and model-based perturbation, are introduced and research issues within those models are outlined. 5 In Chapter 3, the additive-noise-based randomization techniques are analyzed. To preserve individual privacy, the model perturbs the data by introducing the additive noise. This chapter firstly introduces how to learn distributions from the randomized data followed by various data mining algorithms which are used to reconstruct the individual data, therefore, acting as potential threats to the privacy. Then it answers three important questions by carefully analyzing a representative data reconstruction method, spectral-filtering-based method. Those questions are: What is the best strategy to reconstruct the original data based on this method? What is the upper bound of the reconstruction error? What is the lower bound of the reconstruction error? As a potential threat to the privacy in this model, another possible attacking method, IQR-based attack, is also proposed in this chapter. In Chapter 4, the projection-based randomization technique is analyzed. Different from additive-noise-based randomization, projection-based approach randomizes the original data through a linear transformation which is, in form, a projection matrix applied on the original data matrix. Two typical types of projections and their properties are introduced at the beginning of this chapter: distance-preserving-based and non-distance-preservingbased projections. Vulnerabilities of different projection models are then discussed and evaluated, with sections on direct attack and sample-based attacks. In particular, this chapter offers an attacking method, called A-priori-Knowledge-ICA (AK-ICA), which is effective for all the projection-based randomization models. In Chapter 5, General-Location-Model is proposed to privacy preserving data mining. The General Location Model acts as an efficient tool to model real-life databases for privacy preserving applications. For the numerical data in this model, how to analyze and control the privacy is discussed in detail in three different scenarios. Chapter 6 concludes this dissertation with a brief summary of the research presented and offers the future directions. CHAPTER 2: BACKGROUND AND RESEARCH ISSUES Perfect privacy can be achieved without sharing any data, but it offers no utility; perfect utility can be provided by publishing the exact data collected from our lives, but it sacrifices the privacy. The ”inevitable conflict between the individual’s right to privacy and the society’s need to know and process information” has been addressed since 1970’s in database and statistics communities [Chang and Moskowitz 2000; Duncan and Mukherjee 2000; Evans, Zayatz and Slanta 1998; Fienberg, Makov and Steele 1998; Gouweleeuw et al. 1998; Mukherjee and Duncan 1997]. In recent decades, significant efforts have been spent on regulation [Congress 1999; Commission 1998b; Congress 1996; Commission 1998a], privacy policy description [Karjoth and Schunter 2002; FischerHübner 2001; Backes, Pfitzmann and Schunter 2003; W3C 2002] and implementation [Ashley et al. 2002; Karjoth, Schunter and Waidner 2002]. An increasing number of enterprises make privacy promises to meet customer demand or to implement privacy regulations. The privacy issues in data mining began to be investigated in the late 1990’s. Over the past several years, growing number of successful techniques were proposed in the literature to obtain valid data mining results while preserving privacy at different levels. This chapter reviews the existing privacy preserving data mining techniques and outlines the important research issues which are addressed in this dissertation. 2.1 Privacy Preserving Data Mining and its Applications We classify representative privacy preserving data mining techniques into several categories. Generally, there are three approaches: Secure Multi-party Computation (SMC), Data Randomization, and Data Imputation and Synthesis. 7 2.1.1 Secure Multi-Party Computation Secure Multiparty Computation(SMC) is a technique addressing the problem of computing a joint function based on multiple inputs. Each party in a distributed environment keeps one part of private inputs. SMC ensures that no more information should be disclosed to one party than the output of the joint function and its own share of input. The problem of SMC was firstly formulated by Yao [Yao 1982] and extended by Goldreich et al. [Goldreich, Micali and Wigderson 1987], and by many others. In the ideal model, all parties send their inputs to a trusted third party (TTP), who then performs the computations and delivers only the results to other parties. In semihonest model, adversary correctly follows the protocol with the exception that it attempts to learn additional information by analyzing all the intermediate computations. While in the malicious model, adversary may arbitrarily deviate from the protocol specification (e.g. aborting or suspending computation). According to [Benenson, Freiling and Kesdogan 2005], a protocol solves secure multiparty computation (SMC) if it owns the following properties: 1. (SMC-Validity) If a process receives an F -result, then F was computed with at least the inputs of all correct processes. 2. (SMC-Agreement) If some process pi receives F -result ri and some process pj receives F -result rj then rj = rj . 3. (SMC-Termination) Every correct process eventually receives an F -result. 4. (SMC-Privacy) Faulty processes learn nothing about the input values of correct processes (apart from what is given away by the result r and the input values of all faulty processes). Assume that F is a well-known deterministic function, and denote F -result as a result computed by F , r = F (x1 , x2 , · · · , xn ). Several SMC-based privacy-preserving data mining schemes have been proposed [Lindell and Pinkas 2002; Pinkas 2002; Vaidya and Clifton 2002; Vaidya and Clifton 2003; 8 Clifton et al. 2003]. Lindell and Pinkas [Lindell and Pinkas 2002] introduced SMC for classification over horizontally partitioned data using the ID3 algorithm. Vaidya and Clifton proposed the solutions to the clustering problem [Vaidya and Clifton 2003] and the association rule mining problem [Vaidya and Clifton 2002] for vertically partitioned data. Several SMC tools and fundamental techniques are also proposed in the literature [Pinkas 2002; Clifton et al. 2003]. Some more schemes were presented in recent conferences as follows. Wright et al. [Wright and Yang 2004] and Menget al. [Meng, Sivakumar and Kargupta 2004] used SMC to solve privacy-preserving Bayesian network problems. Gilburd et al. proposed a new privacy model, k-privacy, for real-world largescale distributed systems [Gilburd, Schuster and Wolff 2004]. Sanil et al. described a privacy-preserving algorithm of computing regression coefficients [Sanil et al. 2004]. Du et al. have developed building blocks to solve secure two-party Multivariate Linear Regression and Classification problems [Du, Han and Chen 2004]. Wang et al. used an iterative bottom-up generalization to generate data, which remains useful to classification but difficult to disclose private sources [Wang, Yu and Chakraborty 2004]. SMC provides us a good research framework of conducting computations among multiple parties while maintaining the privacy of each party’s input. However, all of the known methods for secure multiparty computation rely on the use of a circuit to simulate the particular function, which becomes the efficiency bottleneck. Even with some improvements [Gennaro, Rabin and Rabin 1998], the computational costs for problems of interest remain high, and the impact on real-world applications has been negligible 2.1.2 Data Randomization In the randomization approach, random noises are added or transformations are applied to the original data, and only the disguised data are shared [Agrawal and Srikant 2000; Agrawal and Agrawal 2001; Rizvi and Haritsa 2002; Evfimievski et al. 2002; Du and Zhan 2003]. Representative randomization methods include additive-noise-based perturbation, projection-based perturbation, and Randomized Response scheme. 9 Additive-Noise-Based Perturbation Agrawal and Srikant proposed a scheme for privacy-preserving data mining using random perturbation [Agrawal and Srikant 2000]. In their randomization scheme, a random number is added to the value of a sensitive attribute. For example, if xi is the value of a sensitive attribute, xi +ri , rather than xi , will appear in the database, where r is a random value drawn from some distribution. It is shown that given the distribution of random noises, recovering the distribution of the original data is possible. Authors in [Agrawal and Agrawal 2001] solved above problem with a Expectation Maximization(EM) estimation algorithm [Dempster, Laird and Rubin 1977; G. J. McLachlan 1998]which has better convergence properties. They showed that their estimation is able to converge to the maximum likelihood estimate (MLE). Wu further optimized the computation of the reconstruction algorithm by using a signal processing approach [Wu 2003]. Tan et. al. [Tan and Ng 2007] proposed a two-step non-iterative distribution reconstruction algorithm based on Parzen-window reconstruction [Parzen 1962] and Quadratic Programming over a convex set [Kozlov, Tarasov and Khachian 1979]. The algorithm avoids the cost due to the iterations in EM and it is also proven to be generic for many randomization models which satisfy a given form. The randomization techniques have been used for a variety of privacy preserving data mining tasks [Agrawal and Agrawal 2001; Rizvi and Haritsa 2002; Evfimievski et al. 2002; Du and Zhan 2003]. Under the scheme, Evfimievski et al. proposed an approach to conduct Privacy-Preserving Association Rule Mining [Evfimievski et al. 2002]. Kargupta et al. challenged the randomization schemes, and pointed out that randomization might not be secure [Kargupta et al. 2003]. They also proposed a random matrix-based Spectral Filtering (SF) technique to recover the original data from the perturbed data. Huang et al. further proposed two other data reconstruction methods: PCA-DR and MLE-DR in [Huang, Du and Chen 2005]. The former one is based on the Principal Component Analysis (PCA), while the latter one uses Maximum Likelihood Estimation (MLE). Their results showed that the recovered data can be reasonably close to the original data. However, what’s the best strategy the attackers might choose and 10 how close might be the recovered data to the original one, are not addressed in their works. Motivated by the spectral-filtering-based method, Guo et al. [Guo, Wu and Li 2006b] improved the spectrum selecting strategy to achieve the optimal performance. Besides, they theoretically analyzed the spectral-filtering-based method and bounded the reconstruction error which is meaningful to both data miners and attackers, [Guo and Wu 2006; Guo, Wu and Li 2006b]. Guo et al. also challenged additive-noise-based model by proposed IQR attack in [Guo, Wu and Li 2006a]. According to their defined privacy quantification, individual privacy may be threatened by the estimated distribution. All the above results indicated that for certain types of data, additive-noise-based randomization might not preserve privacy as much as we expected. Projection-Based Perturbation The projection based perturbation model can be described by Y = RX (2.1) Where X ∈ Rp×n is the original data set consisting of n data records and p attributes. Y ∈ Rq×n is the transformed data set consisting of n data records and q attributes. R is a q × p transformation matrix. In this study, we shall assume q = p = d for convenience. In [Chen and Liu 2005], the authors defined a rotation based perturbation method, where the transformation matrix R is a d × d orthogonormal matrix satisfying RT R = RRT = I. The key features of rotation transformation are preserving vector length, Euclidean distance and inner product between any pair of points. Intuitively, rotation preserves the geometric shapes such as hyperplane and hyper curved surface in the multidimensional space. It was proved in [Chen and Liu 2005] that three popular classifiers (kernel method, SVM, and hyperplane-based classifiers) are invariant to the rotation based perturbation. Previously, the authors in [Oliveira and Zaiane 2004] defined another rotation-based data perturbation function that distorts the attribute values of a given data matrix to preserve privacy of individuals. The perturbation matrix R they used is an orthogonormal matrix when there are even number of attributes. If we have odd number of 11 attributes, according to their scheme, the remaining one is distorted along with any previous distorted attribute, as long as some condition is satisfied. By observing vulnerabilities of the above distance-preserving-based projection, Liu et al. [Liu, Giannella and Kargupta 2006] discussed possible attacks, including a knowninput attack which is based on linear regression and a known sample attack which is based on principal component analysis. Liu et al [Liu, Kargupta and Ryan 2006] further proposed a random projection-based multiplicative perturbation scheme and applied it for privacy preserving distributed data mining. The random matrix Rk×m is generated such that each entry ri,j of R is independent and identically chosen from some normal distribution with mean zero and variance σr2 . Thus, the following properties of the rotation matrix are achieved. E[RT R] = kσr2 I E[RRT ] = mσr2 I If two data sets X1 and X2 are perturbed as Y1 = √ 1 RX1 kσr and Y2 = √ 1 RX2 kσr perspec- tively, then the inner product of the original data sets will be preserved from the statistic point of view: E[Y1T Y2 ] = X1T X2 Randomized Response Randomized Response (RR) is a technique originally developed in the statistics community to collect sensitive information from individuals in such a way that survey interviewers and those who process the data do not know which of two alternative questions the respondent has answered. Instead of asking interviewee whether he/she belongs to a sensitive category A, the interviewer asks each interviewee two mutually exclusive questions: 1. Do you belong to the category A? 2. Do you belong to the category Ā? The one to be answered is determined by a randomize device. The probability of choosing the first question is θ, and the probability of choosing the opposite one is 1 − θ. 12 Without knowing which question is answered, the interviewer shall have no idea about the collected response, e.g. ”yes” or ”no”. Since no one but the respondent knows to which question the answer pertains, the technique provides response confidentiality and increases respondents’ willingness to answer sensitive questions. Assuming all interviewees told the truth, we have P (”yes”) = P (A) · θ + P (Ā) · (1 − θ) = P (A) · θ + (1 − P (A)) · (1 − θ) (2.2) If θ 6= 1/2, without accessing the exact private information, the proportion of interviewees who actually belong to the category A can be estimated. P (A) = θ−1 1 + P (”yes”) 2θ − 1 2θ − 1 The Randomized Response(RR) was firstly proposed by Warner in 1965 [Warner 1965]. It is mainly used to deal with categorical data and can be extended to estimate the distribution of numerical data. Different other models and corresponding discussions for categorical and numerical data can be found in [Chaudhuri and Mukerjee 1988; Poole 1974; Duffy and Waterton 1984; Poole and Clayton 1982]. In data mining community, Rizvi and Haritsa presented a scheme called MASK to mine associations with secrecy constraints [Rizvi and Haritsa 2002]. Du and Zhan proposed an approach to conduct Privacy-Preserving Decision Tree Building [Du and Zhan 2003]. Guo et al. addressed the issue of providing accuracy in terms of various reconstructed measures (e.g., support, confidence, correlation, lift, etc.) in privacy preserving market basket data analysis [Guo, Guo and Wu 2007]. More specifically, they presented a general method based on the Taylor series to approximate the mean and variance of estimated variables from the randomized data. They also showed that the derived confidence ranges and monotonic property of seom measures are critical for the rule selection. 2.1.3 Data Imputation and Synthesis In general databases, particular individual values or even the whole data set are sensitive and might lead to identification disclosure. Anonymization can be achieved by 13 suppressing individual values, swapping sensitive values, replacing certain attribute values with a general one, or even replacing the whole data set with a synthetic data set. Data Swapping This technique transforms the database by switching a subset of attributes between selected pairs of records so that statistical properties such as marginal distributions of individual attributes are preserved while data confidentiality is achieved. This technique was first proposed by authors in [Dalenius and Reiss 1982]. A variety of refinements and applications [Fienberg and McIntyre 2003] of data swapping have been addressed since its initial appearance. Data Suppression As a technique applied in Statistical Database(SDB), it suppresses those cells in released tables that might directly or indirectly disclose confidential information. Its early application by census bureaus for data publishing was studied in [Cox 1980; Sande 1983]. Thorough studies on this technique can be found in [Denning, Schlörer and Wehrle 1982; Özsoyoglu and Chung 1986]. Due to the high information loss caused by this technique and complex queries in practice, its application to real world databases has inevitable limitations. Data Synthesis In [Rubin 1993], the author firstly suggested research effort to develop a technique for publishing synthetic data without releasing any actual individual value. Based on the success of multiple imputation [Rubin 1987], the published synthetic data set could be derived from distributions learnt from the actual data. Many difficult and complex modeling issues have been addressed [Kam and Ullman 1977; Papageorgiou et al. 2001; Reiter 2002; Reiter 2003; Raghunathan, Reiter and Rubin 2003]. In [Ramesh, Maniatty and Zaki 2003], authors proposed a method to generate a market basket data set for bench marking when the length distributions of frequent and maximal frequent item set collections are available. Wu et al. in [Wu, Wang and Zheng 2003; Wang, Wu and Zheng 2004; Wu et al. 2005a; Wu, Wang and Zheng 2005; Wu et al. 2005b] 14 proposed a general framework for privacy preserving database application testing by generating synthetic data sets based on some a-priori knowledge about the production databases. The general a-priori knowledge such as statistics and rules can also be taken as constraints of the underlying data records. In [Aggarwal and Yu 2004], authors proposed a condensation approach which aims at preserving the covariance matrix for multiple columns. Different from the randomization approach, it perturbs multiple columns as a whole to generate the perturbed dataset. The authors argued that the perturbed dataset preserves the covariance matrix, and thus, most existing data mining algorithms can be applied directly to the perturbed dataset without redeveloping new ones. 2.2 Research Issues in Preserving Privacy for Numerical Data This section highlights the main research issues which will be addressed in the remaining part of this dissertation. Those issues are raised from different privacy preserving models introduced in the previous section with focus on the numerical data. 2.2.1 Issues in Additive-Noise-Based Perturbation Consider a data set X with m records of n attributes and a noise data set E with same dimensions as X. The random value perturbation techniques generate a perturbed data matrix Y = X +E . Let X̂ denote the estimate which the users (or attackers) can achieve. To preserve utility, certain aggregate characteristics (i.e., mean and covariance matrices for numerical data, or marginal totals in contingency table for categorical data) of X should remain basically unchanged in the perturbed data Y or can be restored from the estimated data X̂ . In other words, distributions of X can be approximately reconstructed from the perturbed data Y when some a-priori knowledge (e.g., distribution, statistics etc.) about the noise E is available using distribution reconstruction approaches (e.g., [Agrawal and Agrawal 2001], [Agrawal and Srikant 2000]). To preserve privacy, not only the difference between Y and X but also that between X̂ and X should be greater than some tolerated threshold. Here we follow the tradition of using the difference as the measure to quantify how much privacy is preserved. 15 A key element in preserving privacy and confidentiality of sensitive data is the ability to evaluate the extent of all potential disclosure for released data. In other words, we need to answer to what extent confidential information in the perturbed data can be compromised by attackers. Hence, we should consider not only the perturbed data, Y , which is released directly, but also the estimated data, X̂ , which attackers may exploit various reconstruction methods to obtain. The methods investigated in [Agrawal and Agrawal 2001; Agrawal and Srikant 2000] only focused on how to reconstruct the distribution of the original data from the perturbed data. But it did not consider the issue that attackers may reconstruct the individual values through various means. The previous work in [Huang, Du and Chen 2005; Kargupta et al. 2003] exploited spectral properties of the data and showed that the noise may be separated from the perturbed data under some conditions and as a result privacy could be seriously compromised. Although they empirically assessed the effects of perturbation on the accuracy of the estimated individual value, one major question is what explicit form between reconstruction accuracy and noise added may exist. In other words, what bounds of reconstruction accuracy can be achieved by this spectral filtering technique? Other research issues may include, but not restricted to, developing the best strategy to reconstruct the data, quantifying privacy disclosure and exploring other possible threats to data owner’s privacy. 2.2.2 Issues in Projective-Transformation-Based Perturbation Distance-preserving-based projection has gained much attention in privacy-preserving data mining in recent years since it mitigates the privacy/accuracy tradeoff by achieving perfect data mining accuracy. In the meanwhile, its vulnerabilities are, still, of great interest to data owners and attackers. A general projection-based perturbation can be expressed as Y = RX where R is a transformation matrix and X,Y are input and output respectively. For the case where RT R = RRT = I, it seems that privacy is well preserved after rotation, however, a small known sample may be exploited by attackers to breach privacy 16 completely. We assume that a small data sample from the same population of X is available to attackers, denoting as X̃. When X ∩ X̃ = X ‡ 6= ∅, since many geometric properties (e.g. vector length, distance and inner product) are preserved, attackers can easily locate X ‡ ’s corresponding part, Y ‡ , in the perturbed data set by comparing those values. From Y = RX, we know the same linear transformation is kept between X ‡ and Y ‡ : Y ‡ = RX ‡ . Once the size of X ‡ is at least rank(X) + 1, the transformation matrix R can easily be derived through linear regression. For the case where X ‡ = ∅ or too small, the authors in [Liu, Giannella and Kargupta 2006] proposed a Principal Component Analysis(PCA) based attack. The idea is briefly given as follows. Since the known sample and private data share the same distribution, eigenspaces (eigenvalues) of their covariance matrices are expected to be close to each other. As what we knew, the transformation here is a geometric rotation which does not change the shape of distributions (i.e., the eigenvalues derived from the sample data are close to those derived from the transformed data). Hence, the rotation angels between the eigenspace derived from known samples and those derived from the transformed data can be easily identified. In other words, the rotation matrix R is recovered. We notice that all the above attacks are just for the case in which the transformation matrix is orthonormal. In our general setting, the transformation matrix R can be any matrix (e.g. shrink, stretch, dimension reduction) rather than the simple orthonormal rotation matrix. When we try to apply the PCA attack on non-isometric projection scenario, the eigenvalues derived from the sample data are not the same as those derived from the transformed data. Hence, we cannot derive the transformation matrix R from spectral analysis. As a result, the previous PCA based attack will not work any more. Is individual privacy well protected by such transformation? Is this kind of perturbation vulnerable to any other attacks? More thorough disclosure analysis will be discussed for such general scenario in this dissertation. 2.2.3 Issues in Model-Based Privacy Preserving Data Mining The issue of confidentiality and privacy in general databases has become increasingly prominent in recent years. A key element in preserving privacy and confidentiality of 17 sensitive data is the ability to evaluate the extent of all potential disclosure for such data. In other words, we need to be able to answer to what extent confidential information in a perturbed or transformed database can be compromised by attackers or snoopers. This is a major challenge for current randomization based approaches. To evaluate the privacy and confidentiality residing in general databases which contain both categorical attributes and numerical attributes, the authors in [Wu, Wang and Zheng 2005] proposed a general framework for modeling general databases using the General Location model. One advantage of the general location model is it can be used to conduct both identity disclosure analysis and value disclosure analysis respectively since it integrates both categorical attributes and numerical attributes in one model. Our research will focus on the value disclosure for numerical attributes and give solutions to control the privacy by tuning corresponding parameters of the model. 2.3 Summary Due to the increasing ability to trace, collect and analyze large amount of personal or sensitive data, privacy has become an important issue in various domains. In this chapter, we provided a overview of existing PPDM techniques present in the literature, which can be classified into Secure Multi-party Computation, Data Randomization, and Data Imputation and Synthesis. Research issues related to Data Randomization were outlined and split into three particular areas: data reconstruction in the additive-noise-based perturbation model, data reconstruction in the projection-based perturbation model and the disclosure control in a general-location-model-based privacy preserving application. CHAPTER 3: DISCLOSURE ANALYSIS OF THE ADDITIVE-NOISE-BASED PERTURBATION 3.1 Additive-Noise-Based Perturbation Model In [Agrawal and Srikant 2000], Agrawal and Srikant firstly proposed the additive-noisebased perturbation method for building decision-tree classifiers. To hide the original n values x1 , · · · , xn , n independent random noises e1 , · · · , en have been added and the perturbed data y1 , · · · , yn are released for data mining, where yi = xi + ei . Such process can be illustrated by the following example. Example 1 In Table 1.1, those numerical information(i.e. Balance, Income, Interest Paid, etc.) can be expressed by a matrix X where each row represents a record of one customer. By adding a random noise matrix E with the same dimension, we can get the perturbed data Y as follows. Y = X +E 10 85 ... 2 15 70 ... 18 50 120 ... 35 = 45 23 ... 134 ... ... ... ... 80 110 ... 15 17.334 88.759 19.199 77.537 59.199 128.447 = 51.208 30.313 ... ... 89.048 115.692 7.334 4.199 9.199 + 6.208 ... 9.048 ... 2.099 ... 25.939 ... 38.678 ... 135.939 ... ... ... 21.318 3.759 7.537 8.447 7.313 ... 5.692 ... ... ... ... ... ... 0.099 7.939 3.678 1.939 ... 6.318 The perturbed data shall be quite different from the original ones and distributions of data also change a lot. Therefore, the privacy of data providers is supposed to be well protected when the perturbed data is released instead. To be consistent in this chapter, 19 we use X to denote the original data, E to denote the additive noise and Y to represent the perturbed data. Formally, we have: Y =X +E (3.1) Agrawal and Srikant also showed that the original density distribution of X can be reconstructed effectively given the perturbed data and the noise’s distribution, fE . Based on the reconstructed distribution, decision-tree classifiers can be built with the accuracy comparable to the accuracy of classifiers built with the original data. Their reconstruction algorithm is sketched as Figure 3.1. input Y , a given perturbed data set fE , distribution function of the noise output fˆX , an estimation of the distribution of the original variable BEGIN 1 Assume fX0 as an uniform distribution. 2 j=0 REPEAT P f (y −a)f j (a) 3 fXj+1 (a) = n1 ni=1 R ∞ Ef (yi −z)fXj (z)dz −∞ E 4 5 END i X j =j+1 UNTIL (stopping criterion met) fˆX = fXj+1 (a) Figure 3.1: Distribution Reconstruction Algorithm The posterior distribution function for X is estimated by the average of n posterior distribution functions, FXi |Yi =yi , for i.i.d. variables Xi , where i = 1, · · · , n. And each FXi |Yi =yi is estimated using Bayes’ rule [Fisz 1963]: Z FXi |Yi =yi (a) = a −∞ Z a fXi |Yi =yi (x)dx fXi (x)fYi |Xi =xi (y) dx 0 )f 0 0 (y)dx f (x X Y |X =x i i i −∞ R∞ = −∞ Ra = R−∞ ∞ −∞ fE (yi − x)fX (x)dx fE (yi − x)fX (x)dx 20 Above bootstrap process stops when the difference between two successive estimates becomes small and this process can also be extended to multi-variate data. It seems that individual privacy is well protected in this model. However, through intensive studies of this model [Kargupta et al. 2003; Huang, Du and Chen 2005; Guo and Wu 2006; Guo, Wu and Li 2006b; Guo, Wu and Li 2006a], it was indicated that privacy can still be threatened. The following section will give an introduction of those potential attacks and representative reconstruction algorithms. As part of our contributions of this research, in depth discussion on those reconstruction algorithms, especially the estimation strategy and estimation accuracy, will be given in the remaining sections of this chapter. 3.2 Data Reconstruction Attacks The security of additive-noise-based approach was firstly questioned by Kargupta et al. in [Kargupta et al. 2003]. They showed that attackers may exploit a spectral-filteringbased attack to derive the individual estimation of the original values from the perturbed data. Huang et al. further proposed two other reconstruction algorithms which are efficient when the noise is independent to the original data. Similar to the Spectral Filtering technique, one is based on Principal Component Analysis (PCA). The other one chooses Maximum Likelihood Estimation (MLE) as estimator. This section offers an overview of these reconstruction methods and presents privacy issues we addressed based on them. 3.2.1 Spectral Filtering Method Consider a noise matrix E with same dimensions as the original data X. Entries of the noise are i.i.d. random variables with zero mean and variance σ 2 . The random value perturbation techniques generates a perturbed data matrix Y = X + E. The objective of spectral filtering based approach is to derive the estimation X̂ of X from the perturbed data Y based on random matrix theory. The authors, in [Kargupta et al. 2003], provided an explicit filtering procedure as shown below. The authors, in [Kargupta et al. 2003], focused on the scenario where only a small √ number of instances exists in the data set. Under this case, we have λEmin = σ 2 (1−1/ q)2 21 input Y , m × n matrix, a given perturbed data set σ 2 , variance of the i.i.d. noise with zero mean output X̂, an estimation of the original data set BEGIN 1 Compute covariance matrix ΣY by ΣY = Y T Y . 2 Do eigenvalue decomposition on ΣY ΣY = QY ΛY QTY Where ΛY = diag(λY1 , λY2 , · · · , λYm ), a diagonal matrix with eigenvalues on its diagonal (λY1 ≥ λY2 ≥ · · · ≥ λYm ). QY = (eY1 , eY2 , · · · , eYm )T , an orthogonal matrix with column vectors be the corresponding eigenvectors of ΣY . 3 Since the noise matrix E is generated using i.i.d. distribution with zero mean and known variance, the eigenvalues of its covariance matrix are bounded by λEmin and λEmax according to the random matrix theory. This pair of bounds is calculated as: √ λEmin = σ 2 (1 − 1/ q)2 √ λEmax = σ 2 (1 + 1/ q)2 where q is linear to the ratio between the number of records and the number of attributes. 4 Extract components of ΣY which are related to the original data. The noise-related eigenvalues are λEmax ≥ λEi ≥ λEi+1 ≥ · · · ≥ λEj ≥ λEmin . The remaining k eigenvalues are related to the original data. Let those corresponding eigenvectors, QYk , forms an orthonormal basis of a subspace χ̃. The orthogonal projection on to χ̃ is calculated as Pχ̃ = QYk QTYk 5 Get the estimate data set using X̂ = Y Pχ̃ . END Figure 3.2: Spectral Filtering Process √ and λEmax = σ 2 (1 + 1/ q)2 , where q is linear to the ratio between the number of records and the number of attributes. The authors developed a method on how to filter out the k principle eigenvalues. As in most data mining applications, the number of records far exceeds that of attributes (hence q is large), we can see λEmin ≈ λEmax ≈ σ 2 . 3.2.2 PCA-Based Reconstruction Method Huang et al. in [Huang, Du and Chen 2005] argued that when the original data is highly correlated, the information loss could more clearly quantified by their proposed PCA-based method. Authors in [Huang, Du and Chen 2005] pointed out that the correlation among the 22 input Y , m × n matrix, a given perturbed data set σ 2 , variance of the i.i.d. noise with zero mean output X̂, an estimation of the original data set BEGIN 1 Compute covariance matrix ΣY . 2 Conduct PCA on ΣY to get its eigenvalues and eigenvectors. ΣY = QY ΛY QTY Where ΛY = diag(λY1 , λY2 , · · · , λYm ), a diagonal matrix with eigenvalues on its diagonal (λY1 ≥ λY2 ≥ · · · ≥ λYm ). QY = (eY1 , eY2 , · · · , eYm )T , an orthogonal matrix whose column vectors are the corresponding eigenvectors of ΣY . 3 Derive an approximated covariance matrix of the original data Σ̂X = Q̂X Λ̂X Q̂TX where Λ̂X = diag(λY1 − σ 2 , λY2 − σ 2 , · · · , λYm − σ 2 ) and Q̂X = QY 4 Extract k principal components of Σ̂X . Let Q̂Xk contains those corresponding eigenvectors of the principal components. 5 Get the estimate data set as X̂ = Y Q̂Xk Q̂TXk . END Figure 3.3: PCA-Based Reconstruction Method original data and the correlation between the original data and the noise are key factors affecting the privacy for a randomized data set. Their theoretical discussion and imperial evaluation indicate that adding correlated noise with correlation matrix close to the one of the original data might better preserve the privacy. 3.2.3 MLE-Based Reconstruction Method Huang et al. in [Huang, Du and Chen 2005] also proposed another reconstruction method based on MLE (Maximum Likelihood Estimation), which can be conducted following the procedure in Fig 3.4. When the noise has the correlation similar to the one of the original data, authors in [Huang, Du and Chen 2005] also modified their estimation as −1 −1 −1 −1 −1 x̂i = (Σ̂−1 X + ΣE ) (Σ̂X µ̂X − ΣE µE + ΣE yi ) 23 input Y , m × n matrix, a given perturbed data set σ 2 , variance of the i.i.d. noise with zero mean output X̂, an estimation of the original data set BEGIN 1 Compute covariance matrix ΣY . 2 Conduct PCA on ΣY to get its eigenvalues and eigenvectors. ΣY = QY ΛY QTY Where ΛY = diag(λY1 , λY2 , · · · , λYm ), a diagonal matrix with eigenvalues on its diagonal (λY1 ≥ λY2 ≥ · · · ≥ λYm ). QY = (eY1 , eY2 , · · · , eYm )T , an orthogonal matrix whose column vectors are the corresponding eigenvectors of ΣY . 3 Derive an approximated covariance matrix of the original data Σ̂X = Q̂X Λ̂X Q̂TX where Λ̂X = diag(λY1 − σ 2 , λY2 − σ 2 , · · · , λYm − σ 2 ) and Q̂X = QY 4 Estimate the mean vector of the original data from the perturbed data. µ̂X = µY 5 Get the estimate data set X̂ with its row vectors to be: −1 2 −1 2 x̂i = (Σ̂−1 X + 1/σ · I) (Σ̂X µ̂X + yi /σ ) END Figure 3.4: MLE-Based Reconstruction Method 3.2.4 Privacy Issues Since the original data might be highly correlated and the noise is usually assumed to be Gaussian with zero mean, above techniques have been proved to be able to keep the principal components in the original data and filter the noise as well. One common idea behind Spectral Filtering technique and PCA-based reconstruction method is that they both estimate the individual values by projecting the perturbed data on to a space spanned by some representative eigenvectors. Information in the original data is expected to be largely preserved by such projection and noise is expected to be reduced to a great extent. When we consider the scenario with large number of instances in data sets, λEmin ≈ λEmax ≈ σ 2 . Therefore, these two methods are essentially the same. In this research, we focus the representative Spectral Filtering technique under such scenario. By adding one dimension to the projective space χ, more information will be preserved but more noise will be introduced. So whether should we add one more dimension? Previous works provided different strategies to reconstruct the data. The spectral-filtering- 24 based method keeps principal components of the perturbed data with eigenvalues no less than the variance of the noise, σ 2 . However, this strategy was not proven to be optimized. The PCA-based method requires k principal components of the estimated covariance matrix of the original data. But it did not give an explicit strategy to determine the number of principal components(i.e. k). In this research, we propose an optimized strategy following the essential idea of the Spectral Filtering technique. We also notice that previous works in [Huang, Du and Chen 2005; Kargupta et al. 2003] only empirically assessed the effects of perturbation on the accuracy of the estimated individual value. In this research, we explore the explicit relation between the estimate error (X̂ − X) and the noise E and give clear bounds of kX̂ − XkF . The upper and the lower bound of estimate error are significant for both data miners and attackers. As we introduced, it is possible to reconstruct the distribution of the original data by giving the distribution function of the noise [Agrawal and Srikant 2000]. Another challenge here is whether the reconstructed distribution can be exploited by attackers or snoopers to threaten sensitive individual privacy. We present one simple attack using Inter-Quantile Range(IQR)on reconstructed distribution and show the disclosure of individual privacy from such aggregate information of the reconstructed distribution. Definition 3.1 Let A ∈ Cm×n . The Frobenius norm of A is the number v uX n u m X t kAkF = a2ij i=1 j=1 The 2-norm of A is kAxk2 kAk2 = max |{z} kxk2 x6=0 where kxk2 is for the 2-norm (Euclidean norm) of a vector. Definition 3.1 gives the mathematical form of Frobenius norm and 2-norm, which will be used in the following part. In this study, we cast many of our analysis in terms of absolute and relative errors of matrix norm, instead of component-wise bounds. Basically, the Frobenius form is used to measure the magnitude of data in total while the 2-norm is used to denote the largest eigenvalue of covariance matrix. 25 We list some properties of matrix norm which will be used in our proof as below. Refer linear algebra books (e.g., [Stewart and Sun 1990]) for more details. 1. kABkF ≤ kAkF kBkF and kABk2 ≤ kAk2 kBk2 , when B ∈ Cn×q . 2. kAk2 ≤ kAkF ≤ 3. kAk2 = √ nkAk2 p λmax (AT A), the square root of the largest eigenvalue of AT A 4. if A is symmetric, then kAk2 = λmax (A), the largest eigenvalue of A Definition 3.2 Let X be a subspace of Cn and let the columns of QX form an orthogonal basis for X . The matrix PX = QX QTX is called the orthogonal projection onto X . 3.3 An Improved Strategy for Noise Filtering The original Spectral Filtering algorithm applied the following strategy to determine the first k eigen components. Strategy 1 : k = max{i|λYi ≥ λEmax }. When the data set is large, λEmax ≈ λEmin ≈ λE . It becomes: k = max{i|λYi ≥ λE } We point out that the previous Strategy 1 applied in [Kargupta et al. 2003] in general will not give the optimal reconstruction. The reason is that it aims to include all significant eigen components (with λYi > 0) in projection space for reconstruction. However, since the inclusion of one eigen component also brings some additional noise projected on that eigen vector, the benefit due to inclusion of one insignificant eigen component may be diminished by the side effect of the additional noise projected on this eigen vector. In this research, we propose a new strategy (as shown in Strategy 2) which compares the benefit due to inclusion of one component with the loss due to the additional projected noise. We show that Strategy 2 is expected to give one approximately optimal reconstruction. This strategy is also used in our bound analysis. Strategy 2 : The estimated data using X̂ = Y Pχ̃ = Y QYk Q̃TYk is approximately optimal when k = max{i|λYi ≥ 2λE }. 26 Proof. In the Spectral Filtering method, when we select the first k components, the error matrix can be expressed as f (k) = X̂ − X = (X + E)QYk QTYk − X Ik 0 T = (X + E)QY QY − X 0 0 Ik 0 T T = EQY QY − X[QY IQY − QY 0 0 0 Ik 0 T 0 = EQY QY − XQY 0 0 0 In−k Ik 0 T QY ] 0 0 T QY (3.2) Similarly, when we select the first k + 1 components, the error matrix becomes 0 0 T Ik+1 0 T f (k + 1) = EQY QY QY − XQY 0 In−k−1 0 0 0 T 0 Ik 0 T T = E[QY QY QY + eYk+1 eYk+1 ] − X[QY 0 In−k 0 0 −eYk+1 eTYk+1 ] 0 T 0 Ik 0 T T = (EQY QY ) + EeYk+1 eYk+1 QY − XQY 0 0 0 In−k +XeYk+1 eTYk+1 = f (k) + EeYk+1 eTYk+1 + XeYk+1 eTYk+1 (3.3) The last two parts in Equation 3.3 are the projections of noise and data on the (k+1)th eigenvector. Assume eYi ≈ eXi , the strength of the data projection can be approximated 27 as ||XeYk+1 eTYk+1 ||2F ≈ ||XeXk+1 eTXk+1 ||2F = T r[(XeXk+1 eTXk+1 ||2F )T (XeXk+1 eTXk+1 ||2F )] = T r(eXk+1 eTXk+1 X T XeXk+1 eTXk+1 ) n X T = T r[eXk+1 eXk+1 ( (λXi eXi eTXi )eXk+1 eTXk+1 ] i=1 = T r(λXk+1 eXk+1 eTXk+1 ) = λXk+1 For i.i.d noise, the effect of the projection on any vector should be the same. Thus, ||EeYk+1 eTYk+1 ||2F ≈ λE Hence, we include the i-th component only when the following condition satisfied λXi ≥ λE (3.4) The benefit due to inclusion of the i-th eigen component is larger than the loss due to the noise projected along the i-th eigen component. Considering data variables xi , xj and zero-mean noise variables ei , ej , where the noise is independent with the data. According to the definition of covariance and variance, it is easy to derive Cov(xi + ei , xj + ej ) = < (xi + ei )(xj + ej ) > − < xi + ei >< xj + ej > = < xi xj > + < ei xj > + < xi ej > + < ei ej > − < xi >< xj > = < xi xj > + < ei ej > − < xi >< xj > = Cov(xi , xj ) + Cov(ei , ej ) Therefore, V ar(xi + ei ) = V ar(xi ) + V ar(ei ) (3.5) 28 Considering the condition 3.4, we choose λYi to be λYi = λXi + λE ≥ 2λE . Hence k = max{i|λYi ≥ 2λE } 3.4 Upper Bound Analysis The traditional matrix perturbation theory [Stewart and Sun 1990] focuses on how the perturbation B 1 affects the matrix A. Specifically, it provides precise upper bounds on the eigenvalues, the angle between eigenvectors, or invariance subspaces of a matrix A and that of its perturbation à = A + B, in terms of the norms of the perturbation matrix B. In our scenario, A = XT X à = Y T Y = (X + E)T (X + E) = X T X + E T X + X T E + E T E B = à − A = E T X + X T E + E T E B can be interpreted as a perturbation on the covariance matrix caused by the additive noise E . The primary perturbation Y , which is obtained from X by the addition of an explicit perturbation E, is more meaningful to users than the derived perturbation B. Hence, it is more significant to consider how the primary perturbation E affects the data matrix X rather than how the derived perturbation B affects the covariance matrix A. Note that we have E T X = X T E = 0 when the data and noise are uncorrelated. The above can then be simplified as Y T Y = (X + E)T (X + E) = X T X + E T E. Since X̂ − X ≈ Y Pχ̃ − XPχ = Y Pχ̃ − (Y − E)Pχ = Y (Pχ̃ − Pχ ) + EPχ 1 In the book, author uses E to denote such perturbation. 29 hence we have, ||X̂ − X||F ≈ ||Y (Pχ̃ − Pχ ) + EPχ ||F ≤ ||Y (Pχ̃ − Pχ )||F + ||EPχ ||F ≤ ||Y ||F ||Pχ̃ − Pχ ||F + ||EPχ ||F (3.6) From Equation 3.6, we can see that the difference between the estimated data set and the original one is determined by the invariant subspaces Pχ (Pχ̃ ) of A(Ã), it is natural to access the bias between these subspaces. Proposition 1 Let A ∈ Rn×n be a symmetric positive definite matrix, and let λ1 ≥ λ2 ≥ · · · ≥ λn be its eigenvalues and e1 , e2 , · · · , en be corresponding n eigenvectors. Let the matrices X ∈ Rn×(n−k) be defined according to X = [e1 e2 · · · ek ], Y = [ek+1 · · · en ], so that the matrix [X Y ] ∈ Rn×n is orthogonal and unitary. Given a perturbation B, let à = A + B, and ² = ||B||F . Let χ and χ̃ be the invariant subspace of A and à respectively. χ is spanned by X. Pχ and Pχ̃ are the corresponding orthogonal projection onto these invariant subspaces. Define eigengap δ = λk − λk+1 . There exists a matrix P satisfying √ ||P ||F ≤ 2² √ δ − 2² so that the columns of X̃ = (X +Y P ) form an orthonormal basis for the subspace spanned by the first k eigenvectors of Ã. ||Pχ̃ − Pχ ||F ≤ 2² √ δ − 2² ¦ Before we prove this proposition, let’s introduce a lemma first. Lemma 3.1 Let A ∈ Rn×n be a symmetric positive definite matrix , and let λ1 ≥ λ2 ≥ · · · ≥ λn be its eigenvalues and e1 , e2 , · · · , en be corresponding eigenvectors. Let X = [e1 e2 · · · ek ], Y = [ek+1 · · · en ] so that the matrix [X Y ] ∈ Rn×n is orthogonal and unitary. Given a perturbation B, let à = A + B, let ² = ||B||F > 1/2, and define δ = λk − λk+1 . 30 √ ² If δ > 2 2², then there is a matrix P satisfying ||P ||F ≤ 2 δ−√ so that the columns 2² of X̃ = X + Y P form an orthonormal basis for the subspace spanned by the first k eigenvectors of Ã, ẽ1 , ẽ2 , · · · , ẽk . Proof. Since A is a symmetric positive definite matrix, we can apply spectral decomposition on A: [X Y ]T A[X Y ] = [ L1 0 0 L2 ]. (3.7) where L1 = diag(λ1 , · · · , λk ), and L2 = diag(λk+1 , · · · , λn ). Also, let B̃ = [X Y ]T B[X Y ] = [ F11 F12 ] (3.8) F21 F22 From Theorem V.2.8 of [Stewart and Sun 1990], there exists a matrix P satisfying ||P || ≤ 2 ||F21 || δ − ||F11 || − ||F22 || (3.9) Since [X Y ] is unitary, ² = ||B||F , it holds true that ||B̃||F = ||[X Y ]T B[X Y ]||F = ||B||F = ² Moreover, since ||F11 ||2F + ||F12 ||2F + ||F21 ||2F + ||F22 ||2F = ||B̃||2F and B̃ is symmetric, we have 1 ||B̃||F 2 1 ≤ √ ² 2 ≤ 2(||F11 ||2F + ||F22 ||2F ) ||F21 ||2F = ||F12 ||2F ≤ (3.10) ||F21 ||F = ||F12 ||F (3.11) (||F11 ||F + ||F22 ||F )2 ≤ 2||B̃||2F ||F11 ||F + ||F22 ||F δ − ||F11 ||F − ||F22 ||F = 2||B||2F √ ≤ 2||B||F √ ≥ δ − 2². (3.12) (3.13) 31 Hence, ||P ||F ≤ √ 2 δ− ² √ 2² (3.14) so that the columns of X̃ = (X + Y P ) form an orthonormal basis for the subspace spanned by three eigenvectors of Ã. The representation of à with respect to X̃ is L̃1 = L1 + F11 + F12 P (3.15) The eigenvalues associated with these k eigenvectors are the eigenvalues of L̃1 , and the eigenvalues associate with the rest of Ã’s eigenvectors are the eigenvalues of L̃2 = L2 + F22 − P F12 (3.16) Thus, to complete the proof of the Lemma, it suffices to verify that the eigenvalues of L̃1 are all (strictly) larger than the eigenvalues of L̃2 . √ Since δ > 2 2², we have ||P ||F ≤ √ 2 δ− ² √ <1 (3.17) 2||B||F (3.18) 2² Similarly, we can derive ||F11 ||F + ||F12 ||F ≤ √ Then we have ||F11 + F12 P ||F ≤ ||F11 ||F + ||E12 P ||F ≤ ||F11 ||F + ||F12 ||F ||P ||F ≤ ||F11 ||F + ||F12 ||F √ ≤ 2||B||F √ = 2². (3.19) By the same argument, we also have ||E22 − P E12 ||F ≤ √ 2². (3.20) 32 Since the Forbenius norm upperbounds the Spectral norm, this also shows ||F11 + F12 P ||2 ≤ ||F22 − P F12 ||2 ≤ √ √ 2², (3.21) 2². (3.22) Let the eigenvalues of L̃1 are λ̃1 , λ̃2 , · · · , λ̃k , and those of L̃2 are λ̃k+1 , · · · , λ̃n . The spectral variation of L̃1 with respect to L1 is k k i=1 j=1 svL1 (L̃1 ) = max min |λ̃i − λj | (3.23) The spectral variation of L̃2 with respect to L2 is n n svL2 (L̃2 ) = max min |λ̃i − λj | i=k+1 j=k+1 (3.24) From Corollary IV.3.4 of [Stewart and Sun 1990]: svL1 (L̃1 ) ≤ ||F11 + F12 P ||2 ≤ svL2 (L̃2 ) ≤ ||F22 − P F12 ||2 ≤ √ √ 2² (3.25) 2² (3.26) √ the above conditions ensure that those eigenvalues of L̃1 lie in the interval [λk − 2², λ1 + √ √ √ 2²], and that those of L̃2 lie in the interval [λn − 2², λk+1 + 2²]. As we know √ λk − λk+1 = δ > 2 2² (3.27) so we have λk − √ 2² > λk+1 + √ 2², (3.28) which implies that all of L̃1 ’s eigenvalues are strictly larger than all of L̃2 ’s eigenvalues. ¦ Proof of Proposition 1 We can find an invariant subspace of Ã, and its corresponding orthogonal projection (PX̃ = X̃ X̃ T ). Our aim is to bound ||X̃ − X||F , as well as ||PX − PX̃ ||F . 33 Let M = P T P , then ||M ||F ≤ ||P ||2F ≤ 2²2 δ̃ 2 < 1, where δ̃ = δ − √ 2². ||X̃ − X||F = ||(X + Y P )(I + P T P )−1/2 − X||F = ||(X + Y P )(I − I + (I + P T P )−1/2 ) − X||F = ||X + Y P − (X + Y P )(I − (I + M )−1/2 ) − X||F ≤ ||Y P ||F + ||(X + Y P )(I − (I + M )−1/2 )||F = ||P ||F + ||(X + Y P )(I − (I + M )−1/2 )||F ≤ ||P ||F + (||X||F + ||Y P ||F )||(I − (I + M )−1/2 )||F 2²2 δ̃ 2 2²2 ||P ||F + (||X||F + ||P ||F ) δ̃ 2 √ √ √ 2² 2² 2²2 +( k+ ) δ̃√ δ̃ 2 √δ̃ √ 2² 2² + ( k + 1) δ̃ δ̃ √ √ 2² ( k + 2) δ̃ ≤ ||P ||F + (||X||F + ||Y P ||F ) = ≤ < = According to (pp. 232 of [Stewart and Sun 1990]), we can derive: √ ||PX − PX̃ ||F ≤ 2 2 ||F12 ||F δ − ||F11 ||F − ||F22 ||F √1 ² √ 2 √ ≤ 2 2 δ − 2² 2² √ = δ − 2² (3.29) Proposition 2 Given a symmetric matrix A ∈ Rn×n and a symmetric perturbation B, let à = A + B. Let the eigenvalues of B be ε1 ≥ ε2 ≥ · · · ≥ εn . Let λk and λ̃k are the eigenvalues of A and à respectively, where k = 1, · · · , n, and let δ̃ = λ̃k − λ̃k+1 , δE = ε1 − εn , then δ ∈ [δ̃ − δE , δ̃ + δE ] ¦ 34 Proof. From Corollary 4.9 in [Stewart and Sun 1990], we have: λk ∈ [λ̃k − ε1 , λ̃k − εn ] (3.30) λk+1 ∈ [λ̃k+1 − ε1 , λ̃k+1 − εn ] Since δ = λk − λk+1 , δ ≥ (λ̃k − ε1 ) − (λ̃k+1 − εn ) = (λ̃k − λ̃k+1 ) − (ε1 − εn ) = δ̃ − δE δ ≤ (λ̃k − εn ) − (λ̃k+1 − ε1 ) = (λ̃k − λ̃k+1 ) + (ε1 − εn ) = δ̃ + δE Theorem 3.1 Given a data set X ∈ Rm×n and a perturbation noise set E ∈ Rm×n , let Y = X + E and X̂ to denote the estimate obtained from spectral-filtering-based method. We have kX̂ − XkF ≤ kY kF 2||B||F √ + kEPχ kF (λ̃k − ||B||2 ) − 2||B||F (3.31) where B = E T X + X T E + E T E is the derived perturbation on covariance matrix A = X T X. ¦ Proof. From Proposition 1, ||Pχ̃ − Pχ ||F ≤ 2² √ , δ− 2² we have kX̂ − XkF ≤ kY kF kPχ̃ − Pχ )kF + kEPχ kF ≤ kY kF k 2² √ + kEPχ kF δ − 2² Since the original data are correlated, the rest of the eigenvalues, λXk+1 · · · λXn , are close to 0. Therefore δ ≈ λXk . From Proposition 2, we know λXk ∈ [λYk − ε1 , λYk − εn ]. As kBk2 = ε1 (Property 4), we have λXk ≥ λYk − ε1 = λYk − ||B||2 35 Hence, 2||B||F √ + kEPχ kF δ − 2||B||F 2||B||F √ ≤ kY kF + kEPχ kF (λ̃k − ||B||2 ) − 2||B||F kX̂ − XkF ≤ kY kF Corollary 1 If the noise is generated by i.i.d. Gaussian distribution with zero mean and known variance σ 2 , the upper bound of the reconstruction error can be expressed as p 2kY kF kEk2F √ k/n||E||F (Strategy 1) + (λYk − ||B||2 ) − 2kEk2F p 2kY kF kEk2F √ ≤ k/n||E||F (Strategy 2) + δ̃ − 2||E||2F kX̂ − XkF ≤ (3.32) kX̂ − XkF (3.33) Proof. For the Strategy 1, k always equals to the number of principal components in the original data set. If the original data are highly correlated, the rest of the eigenvalues, λXk+1 · · · λXn , are close to 0. Therefore δ ≈ λXk . From Proposition 2, the eigen gap for this strategy can be bounded as δ ≈ λXk ≥ λYk − ε1 Hence, as ||B||2 = ε1 , 3.31 becomes kX̂ − XkF ≤ kY kF 2||B||F √ + kEPχ kF (λ̃k − ||B||2 ) − 2||B||F In general, the covariance of the noise can be expressed as B = E T X + X T E + E T E. When the noise and signal are completely independent, the above can be simplified as B = E T E. In terms of Frobenius norm, we have kBkF = kE T EkF ≤ kEk2F . When the noise matrix is generated by i.i.d. Gaussian distribution with zero mean and known variance σ 2 , the square error of EPχ is δ 2 = σ 2 nk [Huang, Du and Chen 2005], 36 and ||E||F is √ σ 2 mn, we have √ δ 2 mn r k = σ 2 mn r n k ||E||2F = rn k = ||E||F n kEPχ kF = Then Equation 3.31 becomes kX̂ − XkF ≤ kY kF p 2kEk2F √ k/n||E||F + (λ̃k − ||B||2 ) − 2kEk2F For the Strategy 2, when the noise is i.i.d. Gaussian noise, then ε1 ≈ εn for a large population. In other words, δE in 3.31 is close to zero. Hence 3.31 becomes p 2kEk2F √ + k/n||E||F kX̂ − XkF ≤ kY kF δ̃ − 2kEk2F When the noise is completely correlated with data, kEPχ kF ≈ kEkF as k represents the number of principal components. Hence we have ||XPχ ||F ≈ kXkF . Then Equation 3.31 becomes kX̂ − XkF ≤ kY kF (λYk 2kEk2F √ + ||E||F − ||B||2 ) − 2kEk2F (3.34) The upper bound given in Theorem 3.1 determines how close the estimated data achieved by attackers is from the original one when spectral-filtering-based method is exploited by attackers. This represents a serious threat of privacy breaches as attackers know exactly how close their estimates are. Please note ||E||F and kBk2 are assumed to be available to attackers as they can easily be computed from the published information about noise distribution. 37 3.5 Lower Bound Analysis Let Y = X + E be a perturbation of X, and let ΣY LTY Y RY = 0 be the Singular Value Decomposition(SVD) of Y . Weyl and Mirsky gave us the basic perturbation bounds for the singular values of above matrix. Theorem 3.2 (Weyl) [Weyl 1911] |σYi − σXi | ≤ ||E||2 i = 1, · · · , n Theorem 3.3 (Mirsky) [Mirsky 1960] s X (σYi − σXi )2 ≤ ||E||F i = 1, · · · , n i Let B be any matrix of rank not greater than k and its singular values are denoted as ψ1 ≥ · · · ≥ ψn . Based on Mirsky’s theorem, the sum of squares of the k smallest singular values of A is not greater than ||B − A||2F . Such conclusion can be expressed as: ||B − A||2F ≥ n X |ψi − σXi |2 i=1 2 2 ≥ σX + · · · + σX n k+1 ≥ ||Xk − X||2F Enlightened by the perturbation bounds for the singular values of a matrix, we propose a SVD based reconstruction method so that the error can be lower bounded. In this research, we further analyze our SVD based reconstruction method as well as the spectral filtering technique and prove their equivalence. Then the lower bound of the reconstruction error by using spectral filtering technique is derived. 3.5.1 SVD-based Reconstruction Method Singular Value Decomposition (SVD) decomposes a matrix X ∈ Rm×n (say m ≥ n) into the product of two unitary matrices, LX ∈ Rm×m ,RX ∈ Rn×n , and a pseudo-diagonal 38 input Y , a given perturbed data set E, a noise data set output X̂, a reconstructed data BEGIN 1 Apply SVD on Y to get Y = LY DY RYT 2 Apply SVD on E and assume σEmax ≈ σEmin ≈ σE √ 3 Determine the first k components of Y by k = max{i|σYi ≥ 2σE } Assume σY1 ≥ σY2 ≥ · · · σYk and lYi , rYi are the corresponding left and right singular vectors 4 Reconstructing X approximately as k P X̂ = Yk = σYi lYi rYTi (k ≤ ρ) END i=1 Figure 3.5: SVD Based Reconstruction Algorithm matrix DX = diag(σX1 , · · · , σXρ ) ∈ Rm×n , such that T X = LX DX RX or X= n X T σXi lXi rX i i=1 The diagonal elements σXi of DX are referred to as singular values, which are, by convention, sorted in descending order: σX1 ≥ σX2 ≥ · · · ≥ σXn ≥ 0. The columns lXi and rXi of LX and RX are respectively called the left and right singular vectors of X. Similarly let Y = X+E be a perturbation of X and let Y = LY DY RYT be a perturbation of Y . Figure 3.5 shows our SVD based reconstruction method. Please note that the strategy √ used for the SVD based reconstruction is k = max{i|σYi ≥ 2σE }, where the largest sin√ gular value of the added noise is calculated as σE ≈ ||E||F / n (step 2 in the algorithm). √ For i.i.d. noise, we have ||E||F = mnσ 2 , where the covariance of the noise is σ 2 = 39 λ(E T E) m−1 = ||E||22 . m−1 As we know, the largest singular value σv is ||E||2 . Hence, we have √ ||E||F m − 1 ||E||F √ σE = ||E||2 = ≈ √ mn n 3.5.2 Lower Bound Consider X̂ = Yk = LYk DYk RYTk as the estimation of the original data set X. The estimation error between X̂ and X has its lower bound: ||X̂ − X||F ≥ ||Xk − X||F where k = max{i|σYi ≥ √ 2σE }. The relationship between the reconstruction bias and perturbation (especially the lower bound) will, in turn, guide us to add noise into the original data set. The lower bound gives data owners the worst case security assurance since for any matrix B (with singular values ψ) of rank not greater than k derived by attackers, we have kB − XkF ≥ n X 2 2 | ψi − σXi |2 ≥ σX + · · · + σX ≥ kXk − Xk2F n k+1 i=1 In order to preserve privacy, data owners need to make sure ||X̂ −X||F /||X||F is greater than the privacy threshold τ , specified by users. In the following, we are going to answer how to determine the magnitude of noise to satisfy one privacy threshold. Based on the derived lower bound, 2 2 τ ||X||F ≤ ||Xk − X||F = σX + · · · σX n k+1 Hence k which might be chosen by attackers can be determined by 2 2 k = max{i|τ ≤ (σX + · · · σX )/||X||F } n i+1 (3.35) Based on our approximate optimal strategy,λXi ≥ λE , the data owner should add an i.i.d. noise E and let the eigenvalue of (E T E) satisfy λXk+1 < λE ≤ λXk (3.36) Since λE is the eigenvalue of E T E, the variance of the noise can be derived V ar(E) = 40 λE /(m − 1), where m is the number of row in E. 3.5.3 Equivalence of Two Reconstruction Methods SVD explicitly constructs orthonormal bases for the nullspace and range of a matrix, T X = LX DX RX . The non-zero singular values for X are precisely the square roots of the non-zero eigenvalues of the positive semi-definite matrix XX T , and these are precisely the square roots of the non-zero eigenvalues of X T X. Furthermore, the columns of LX are eigenvectors of XX T and the columns of RX are eigenvectors of X T X. Theorem 3.4 The reconstructed data from Spectral Filtering is X̂SF = Y Pχ̃ = Y QYk QTYk where k = max{i|λYi ≥ 2λE } while the reconstructed data from SVD is X̂SV D = LYk DYk RYTk where k = max{i|σYi ≥ k = max{i|λYi √ 2σE }. We have X̂SF = X̂SV D and the k determined by √ ≥ 2λE } and determined by k = max{i|σYi ≥ 2σE } are exactly the same. Ik Proof. We first prove these two methods are equivalent. Since RYk = RY : 0 Ik Ik Ik T Y RYk = Y RY = (LY DY RY )RY = LY DY = LYk DYk 0 0 0 Since the columns of right singular vectors (RY ) are the eigenvectors of Y T Y , that is QY = RY . Then X̂SF = Y RYk RYTk = LYk DYk RYTk = X̂SV D We then prove the equivalence of determining k. 41 Based on the fact that the singular value of X are the square root of eigenvalues of X T X or XX T , we have: σYi = √ 2σE p λi (Y T Y ) = p λYi p p = 2λ(E T E) = 2λE so, σYi < √ 2σE ⇐⇒ λYi < 2λE Hence max{i|σYi ≥ 3.6 √ 2σE } = max{i|λYi ≥ 2λE } Potential Attack Based on Distribution In the previous sections, we discuss one kind of approaches which generally attempts to hide the sensitive data by randomly modifying the data values using some additive noise and aims to reconstruct the original distribution closely at an aggregate level. The aggregate privacy preserved in the model is investigated by exploring the upper bound and the lower bound of estimation bias. However, another challenge here is whether the reconstructed distribution can be exploited by attackers or snoopers to derive sensitive individual data. This section presents one simple attack using Inter-Quantile Range(IQR)on reconstructed distribution and shows the disclosure of individual privacy from such aggregate information of the reconstructed distribution. Let us consider a scenario where we have a set of n original data values x1 , · · · , xn . Each xi is associated with one privacy interval [wil , wiu ], where xi ∈ [wil , wiu ]. The privacy interval [wil , wiu ] represents the privacy requirement of sensitive data xi pre-defined by its owner. Please note that this privacy interval [wil , wiu ] is specified by its data owner and the data holder is required to satisfy all individual’s privacy concerns although different data owners may have different privacy concerns (some of them may be even unreasonably restrictive) on their individual data. In other words, the data owner disallows attackers or snoopers to derive or estimate this sensitive data falling into its privacy interval. Most existing randomization based approaches add a random number ei , which is 42 drawn from some known distribution, to xi , the value of a sensitive attribute. The randomized value yi = xi +ei is then released. To preserve the privacy, noise distribution of v is expected to be large so all yi (or a large fraction of yi in practice) satisfy yi ∈ / [wil , wiu ]. However, the reconstructed distribution F̂X also provides a certain level of knowledge which can be exploited by attackers or snoopers to estimate individual values with a higher level of accuracy. For example, from the reconstructed distribution, snoopers may learn one aggregate information such as ”95% customers from 28223 zip code and with Asian background have wages [70k, 80k]”, then they can safely conclude that the customer’s wage lies in [70k, 80k] with 95% confidence level once one customer is determined to be from this class. If [70k, 80k] happens to completely lie in this customer’s privacy interval, we say disclosure happens. 3.6.1 Quantification of Privacy Authors in [Agrawal and Agrawal 2001] proposed a metric to measure privacy based on Shannon’s information theory [Shannon 1948; Shannon 1949]. As another approach to measure privacy, the theory of coalitional games is applied to determine the cost of each piece of information. In [Agrawal and Srikant 2000] privacy is measured in terms of confidence intervals. [Rizvi and Haritsa 2002] also suggested its own way of measuring privacy. As pointed out earlier, the reconstruction of the data distribution provides a certain level of knowledge which can be used by attackers or snoopers to estimate a data value to a higher level of accuracy. In this part, we propose a new measure to quantify privacy which will also be used to control the disclosure in Chapter 5. Definition 3.3 Quantile [Conover 1998]. A random variable X is defined by a distribution function FX (or a probability density function fX ). The number xp , for a given value of p between 0 and 1, is called the p-th quantile of the random variable X, if Rx P (X < xp ) ≤ p and P (X > xp ) ≤ 1 − p, where P (X < x) = FX (x) = −∞ fX (²)d². Definition 3.4 Inter-Quantile Range (IQR) [Conover 1998] 2 . The Inter-Quantile Range 2 Inter-Quantile is a general case of Inter-Quartile Range which only considers 1/4, 1/2, 3/4 and 1 quantile points. 43 [xα1 ,xα2 ] is defined as P (xα1 ≤ x ≤ xα2 ) ≥ c%, while c = α2 − α1 denotes the confidence. The IQR [xα1 , xα2 ] is used to measure the amount of spread and variability of the random variable. Hence, it can be used by attackers or snoopers to estimate the range of each individual data, xi , with confidence c = α2 − α1 . For a given c (e.g., 95%), the range [α1 , α2 ] is not unique. We use [(1 − c)/2, (1 + c)/2] in this study whereas the corresponding IQR range is [x(1−c)/2 , x(1+c)/2 ]. The authors, in [Agrawal and Srikant 2000], use a similar measure that defines privacy as follows: if the original value can be estimated with c confidence to lie in the interval [xα , xβ ], then the interval width xβ −xα defines the amount of privacy at c confidence level. Please note that the confidence interval here is different from the classic one defined in statistics where a confidence interval gives an estimated range of values which is likely to include an unknown population parameter (e.g., mean value), the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage of the intervals will include the unknown population parameter. In our scenario, the confidence interval means coverage range. In other words, if we know that [xα1 , xα2 ] covers c percentage of data, we can say [xα1 , xα2 ] covers a given data with c confidence. If the estimated range [x(1−c)/2 , x(1+c)/2 ] contains the individual’s private data xi and fully falls within individual’s privacy interval [wil , wiu ], we say that this individual data is fully disclosed by IQR attack. In general, we use di = [wil , wiu ] ∩ [x(1−c)/2 , x(1+c)/2 ] [wil , wiu ] ∪ [x(1−c)/2 , x(1+c)/2 ] (3.37) to measure how close the IQR obtained by attackers or snoopers is from individual’s privacy interval. In our experiments, we compute both the number of individuals with fully disclosure and the average disclosure using Pn D= i=1 n di (3.38) 44 3.6.2 Extension to Multiple Confidential Attributes When multiple confidential attributes exist, we can extend IQR for each single numerical attribute to confidence region for all attributes together. In practice, the distribution of multiple numerical attributes are often modeled by one multi-variate normal distribution, N (µ, Σ), where µ denotes a vector of means and Σ denotes a covariance matrix. In the p-dimensional space, the confidence region will be an ellipsoidal region given by its probability density contour. In this part we present some known results about density contour of multi-variate normal distribution from statistics. Result 1 (Constant probability density contour) ([Johnson and Wichern 1998], page 134) Let Z be distributed as Np (µ, Σ) with |Σ| > 0. Then, the Np (µ, Σ) distribution assigned probability 1 − α to the solid ellipsoid {z : (z − µ)T Σ−1 (z − µ) ≤ χ2p (α)}, where χ2p (α) denotes the upper (100α-th) percentile of the χ2p distribution with p degrees of √ freedom. The ellipsoid is centered at µ and have axes ±c λi ei , where c2 = χ2p (α) and Σei = λi ei , i = 1, · · · , p. The multi-variate normal density is constant on surfaces where the squared distance (z − µ)T Σ−1 (z − µ) is constant c2 . The chi-square distribution determines the variability of the sample variance. Probabilities are represented by volumes under the surface over regions defined by intervals of the zi values. The axes of each ellipsoid of constant density are in the direction of the eigenvectors of Σ−1 and their lengths are proportional to the square roots of the eigenvalues (λi ) of Σ. Result 2 (Volume of Ellipsoid) ([Grotschel , Lovasz and Schrijver 1988]) The volume of an ellipsoid {z : (z − µ)T A−1 (z − µ) ≤ 1} determined by one positive definite p × p matrix A is given by vol(E) = η| A1/2 |, where η is the volume of the unit ball in Rp . Result 3 shows the general result concerning the projection of an ellipsoid onto a line in a p-dimensional space. 45 Result 3 (Projection of Ellipsoid) ([Johnson and Wichern 1998], page 203) For a given vector ` 6= 0, and z belonging to the ellipsoid {z : z T A−1 z ≤ c2 } determined by a positive definite p × p matrix A, the projection (shadow) of {z T A−1 z ≤ c2 } on ` is q √ T `T A` c ``T A` ` which extends from 0 along ` with length c . When ` is a unit vector, the ` `T ` √ √ shadow extends c `T A` units, so | zT ` |≤ c `T A`. 3.7 Evaluation In our experiment, we use ae(X, X̂) = kX̂ − XkF , the absolute error, and re(X, X̂) = kX̂ − XkF /kXkF , the relative error in X̂ regarded as an approximation to X. 3.7.1 Scenario of Adding Noise In [Kargupta et al. 2003], the noise is assumed as following i.i.d. Gaussian distribution with mean zero and known variance (hence the noise is completely uncorrelated with data). In this section, we consider different scenarios of noise addition. • Case 1. E is an additive noise following i.i.d. Gaussian distribution N (0, ΣE ), where covariance matrix ΣE = diag(σ 2 , · · · , σ 2 ) (The same as in [Kargupta et al. 2003]). • Case 2. E is an additive noise following Gaussian distribution N (0, ΣE ), where covariance matrix ΣE = c × diag(σ12 , σ22 · · · , σn2 ). Here each feature is applied with a separate Gaussian distribution with its variance linear with the variance of original data. • Case 3. E is an additive noise following Gaussian distribution N (0, ΣE ), where co-variance matrix ΣE = c × ΣX . ΣX is the covariance matrix of the original data set. Here the covariance matrix of noise is linear with that of original data. In other words, the noise is completely correlated with data. Case 1 represents the scenario where the noise is completely independent with original data. One example of this scenario is the online collection of customer’s individual data (as the other customer’s data is unknown during data collection). Case 2 represents the variance of the original data is a-priori known while case 3 represents the whole covariance matrix of the original data is used for noise generation. Note in all above three cases, we noise ||E||F /||X||F variance k=1 k=2 Type 1 k=3 re(X, X̂) k=4 k=5 k=6 k=7 c k=1 k=2 Type 2 k=3 re(X, X̂) k=4 k=5 k=6 k=7 c k=1 k=2 Type 3 k=3 re(X, X̂) k=4 k=5 k=6 k=7 E1 0.628 0.213 0.821 0.649 0.440 0.297 0.271 *†0.260 0.282 0.402 0.826 0.654 0.452 0.309 0.279 †0.255 0.294 0.402 0.893 0.800 0.702 †0.650 0.636 0.627 0.627 E2 0.786 0.333 0.825 0.659 0.461 0.337 *0.324 †0.325 0.353 0.630 0.832 0.667 0.479 0.353 †0.345 0.391 0.431 0.630 0.935 0.879 †0.824 0.797 0.788 0.783 0.783 E3 0.954 0.491 0.830 0.671 0.488 *0.383 0.383 †0.395 0.428 0.927 0.841 0.684 0.513 †0.405 0.462 0.512 0.558 0.927 0.989 0.977 †0.964 0.961 0.956 0.955 0.955 E4 1.178 0.750 0.839 0.692 0.529 *0.450 0.465 †0.489 0.530 1.415 0.854 0.709 0.564 †0.479 0.552 0.616 0.673 1.415 1.067 1.117 †1.156 1.177 1.177 1.179 1.179 E5 1.366 1.007 0.847 0.711 0.565 *0.506 0.532 †0.567 0.614 1.903 0.868 0.748 0.613 †0.544 0.631 0.706 0.774 1.903 1.140 †1.240 1.318 1.358 1.361 1.366 1.366 E6 1.677 1.524 0.863 0.750 0.636 *0.607 0.651 †0.699 0.757 2.864 0.897 0.819 0.697 †0.652 0.761 0.856 0.939 2.864 1.276 †1.455 1.593 1.659 1.667 1.675 1.675 E7 1.944 2.040 0.877 0.783 0.694 *0.687 0.745 †0.805 0.873 3.850 0.926 0.876 †0.778 0.900 1.008 1.103 1.190 3.850 †1.398 1.644 1.830 1.918 1.930 1.940 1.940 E8 2.121 2.430 0.890 0.810 *0.739 0.748 0.816 †0.883 0.956 4.583 0.945 0.911 †0.830 0.967 1.085 1.190 1.286 4.583 †1.485 1.769 1.981 2.083 2.097 2.109 2.109 E9 2.985 4.814 0.960 *0.956 0.964 1.032 1.141 †1.245 1.348 9.080 †1.072 1.125 1.120 1.317 1.487 1.634 1.777 9.080 †1.926 2.420 2.779 2.943 2.968 2.987 2.987 Table 3.1: The relative error re(X, X̂) vs. varying E under three scenarios (Type 1, 2, and 3) for the PATTERNS data set. The values with ∗ denote the results following Strategy 2, while the values with † denote the results following the Strategy 1. The bold values indicate those best estimations achieved by the Spectral Filtering technique. 46 noise ||E||F /||X||F variance k=1 k=2 Type 1 k=3 re(X, X̂) k=4 k=5 k=6 c k=1 k=2 Type 2 k=3 re(X, X̂) k=4 k=5 k=6 c k=1 k=2 Type 3 k=3 re(X, X̂) k=4 k=5 k=6 E1 0.172 0.005 0.267 0.212 0.185 0.176 0.173 *†0.172 0.302 0.274 0.231 0.207 †0.193 0.182 0.172 0.302 0.276 0.233 0.208 †0.193 0.183 0.172 E2 0.176 0.0052 0.268 0.213 0.186 0.178 *0.176 †0.176 0.312 0.275 0.233 0.210 †0.196 0.186 0.176 0.312 0.276 0.235 0.210 †0.196 0.186 0.176 E3 0.188 0.006 0.269 0.217 0.192 *0.186 0.186 †0.188 0.362 0.276 0.238 †0.218 0.205 0.196 0.187 0.362 0.279 0.240 †0.218 0.206 0.196 0.188 E4 0.218 0.008 0.273 0.226 0.207 *0.206 0.211 †0.218 0.604 0.283 0.254 †0.240 0.231 0.224 0.218 0.604 0.286 0.256 †0.239 0.230 0.223 0.217 E5 0.231 0.009 0.275 0.230 *0.214 0.216 0.223 †0.231 1.200 0.286 0.260 †0.249 0.241 0.235 0.231 1.200 0.289 0.263 †0.249 0.242 0.236 0.231 E6 0.243 0.01 0.276 0.234 *0.221 0.225 0.233 †0.243 3.028 0.289 0.267 †0.258 0.252 0.246 0.242 3.028 0.292 0.271 †0.259 0.253 0.248 0.243 E7 0.266 0.012 0.280 0.243 *0.234 0.241 0.253 †0.266 6.037 0.294 0.281 †0.276 0.272 0.269 0.266 6.037 0.298 0.284 †0.276 0.272 0.269 0.266 E8 0.297 0.015 0.285 0.254 *0.254 0.265 0.281 †0.297 12.15 0.303 0.299 †0.300 0.299 0.297 0.297 12.15 0.308 0.302 †0.300 0.298 0.297 0.296 E9 0.326 0.018 0.290 *0.266 0.269 0.286 0.306 †0.326 30.26 0.312 †0.317 0.324 0.325 0.325 0.326 30.26 0.317 †0.321 0.323 0.325 0.325 0.326 Table 3.2: The relative error re(X, X̂) vs. varying E under three scenarios (Type 1, 2, and 3) for the ADULT data set. 47 48 assume the noise is generated with Gaussian distribution and its associated mean vector is zero. This assumption is generally true in privacy preserving data mining applications as the change of mean will significantly affect the accuracy of data mining results. √ From the discussion in Section 3.4, for case 1, we have ||E||F ≈ σ 2 mn while for both case 2 and 3, q ||E||F ≈ c(σ12 + σ22 · · · + σn2 )m Hence, we can derive c≈ ||E||2F (σ12 + σ22 · · · + σn2 )m (3.39) In our following experiments, we perturb the original data by different level of noises, which are generated by varying the covariance matrix ΣE , for all three cases (for case 2 and case 3, we derive the corresponding c from the given kEkF using Equation 3.39). For each perturbed data, we use our spectral filtering technique to reconstruct the point-wise data. We also show how the reconstruction accuracy is affected by varying k. Table 3.1 shows all experimental results. 3.7.2 Effect of Varying the Number of Principal Components In Section 3.3, we present one heuristic how to determine k by examining the eigenvalues of covariance matrix of perturbed data and the eigenvalues of covariance matrix of noise. It is easy to see different k leads to different reconstruction errors (which are measured by ||X − X̂||F /||X||F ). From Table 3.1, we can see our spectral filtering method can achieve optimal results for relative small perturbations under both Case 1 and Case 2 (we will explain why case 3 is different in Section 3.7.4). Note the values in bold font highlight the results achieved by our algorithm while the values with * denote the optimal results. When we examine the original data, there exist 4 principle components as the data is highly correlated among 35 features. Hence, for relative small perturbations, the effects on the remaining 31 components are safely filtered in both case 1 and case 2. However, when we increase the noise level (i.e., kEkF increases), the noise will tend 49 to affect the determination of k. This is because the gain of correct inclusion of some (not very significant) principal component is diminished by the loss of noise due to the inclusion of that component. Figure 3.6 shows the reconstruction of original sinusoidal data (attribute 2) with varying k when σ 2 = 0.5 under case 1. When we choose k = 4, the filtering provides an accurate estimate of the individual data while the reconstruction accuracy is poor when we choose k = 1. 3 2 1 0 −1 −2 −3 Original Data Perturbed Data Estimate Data 2 Value of V2 (k=4) Value of V2 (k=1) 3 Original Data Perturbed Data Estimate Data 1 0 −1 −2 0 50 100 150 200 Number of Instance 250 300 −3 0 50 100 (a) k = 1 150 200 Number of Instance 250 300 (b) k = 4 3 Original Data Perturbed Data Estimate Data Value of V2 (k=5) 2 1 0 −1 −2 −3 0 50 100 150 200 Number of Instance 250 300 (c) k = 5 Figure 3.6: Reconstruction accuracy (data distribution for attribute 2) vs. varying k with σ 2 = 0.5 Table 3.2 shows our experimental results on the relative error re(X, X̂) with three scenarios (Type 1, 2, and 3 noises) for the Adult data set. We have similar observations as those on the Patterns data set. For example, Strategy 2 can always achieve optimal estimates for the i.i.d. noise (Type 1) while Strategy 1 usually incurs more inaccuracies since it tends to include all major components (6 in this data set) without considering 50 the side effect incurred by the inclusion of noise. We can also observe in general that the more noise we add, the greater the reconstruction error. This observation is held across all three types of noises. 3.7.3 Effect of Varying Noise In the next experiments, we vary the variance of the added noise from 0.213 (E1) to 4.814 (E9) as shown in Table 3.1. We denote the values with ∗ as the results following the Strategy 2, while the values with † as the results following the Strategy 1. For each noise data set, we also show all the relative reconstruction errors by varying k values. The values in bold font highlight the best results achieved by varying k. From Table 3.1, we can see that our Strategy 2 can achieve optimal results for all perturbations from E1 to E9. The Strategy 2 can match the best results while the Strategy 1 suffers when relative large perturbations are introduced. The Spectral Filtering with the Strategy 1 always include all 6 principle components in the projection space across all 9 noise data sets. On the contrary, the Strategy 2 compares the magnitude of the principle components with the magnitude of noise added to determine k. For example, the best k value for noise E4 is 4 as shown in Table 3.1. The reason is that the magnitude of the last two principle components is not as significant as that of noise projected along the corresponding components. Hence, the gain of inclusion of the last two (not very significant) principal components is diminished by the loss due to the inclusion of noise projected on those components. Quality of the data reconstruction depends upon the relative noise contained in the perturbed data. As the noise added to the actual value increases, the reconstruction accuracy decreases. Figure 3.7 shows point-wise data distributions of reconstruction for feature 2 (we get a sample of 300 data records) when we vary the noise level. We can see when the noise-to-signal ratio kEkF /kXkF is 0.628, (the corresponding variance σ 2 = 0.213), the Spectral Filtering technique can achieve relatively accurate estimation because the effects due to the noise projection on the remaining 29 components are safely filtered. When we increase the noise-to-signal ratio to 1.366 (the corresponding noise variance is σ 2 = 1.007), the reconstruction accuracy decreases as shown in Figure 3.7(b). 51 3 3 Original Data Perturbed Data Estimate Data 2 1 Value of D2 Value of D2 1 0 0 −1 −1 −2 −2 −3 Original Data Perturbed Data Estimate Data 2 0 50 100 150 200 Number of Instance 250 300 −3 (a) kEkF /kXkF = 0.628 0 50 100 150 200 Number of Instance 250 300 (b) kEkF /kXkF = 1.366 Figure 3.7: Reconstruction accuracy (point-wise data distribution for attribute 2) with best k vs. varying noise magnitude The reasons are two-fold. First, much larger noise exists in the projection space. Second, information contained in those principle components excluded from the projection space is lost since the large noise tends to affect the determination of k. 3.7.4 Effect of Covariance Matrix of the Noise From Table 3.1, we can see the spectral filtering method generally cannot achieve good results for case 3 where the noise covariance matrix is linear with the data covariance matrix. As the noise is not randomly generated, spectral filtering technique, which is random matrix based, can not satisfactorily separate noise from data as they share the same distribution pattern. Figure 3.8 compares the reconstruction accuracy for attribute 2 with the same kEkF = 323 under three cases. We can see spectral filtering performs best for completely random perturbation (case 1) while performs worse for completely correlated perturbation (case 3). We would point out it may also affect the accuracy of data mining significantly when the noise added can not be separated from the original data. 3.7.5 Utility To measure the utility, we apply the universal information loss I(fX , fˆX ). To derive the density distribution fX (x) of the original data and fˆX (x) of the corresponding reconstructed data, we equally divide each dimension into 5 bins and compare the multidimen- 52 1.5 2 Original Data Perturbed Data Estimate Data 1.5 0.5 0.5 Value of D2 Value of D2 1 0 −0.5 0 −0.5 −1 −1 −1.5 −1.5 −2 Original Data Perturbed Data Estimate Data 1 0 50 100 150 200 Number of Instance 250 300 −2 0 50 100 (a) Case 1 150 200 Number of Instance 250 300 (b) Case 2 2 Original Data Perturbed Data Estimate Data Value of D2 1 0 −1 −2 −3 0 50 100 150 200 Number of Instance 250 300 (c) Case 3 Figure 3.8: Reconstruction accuracy (data distribution for attribute 2) with kEk = 323 under three cases sional histograms based on the frequency information contained in those 56 6-dimensional bins. Table 3.7.5 shows our results on the utility loss of the reconstructed adult data with different levels of Type 1 noise (E1 to E9). From Table 3.7.5 and Figure 3.9 (where we increase the magnitude of the noise), we can observe that Strategy 2 always outperforms Strategy 1 in terms of preserving utility. Another observation is that in general the greater the magnitude of the noise, the less utility we can preserve. To evaluate how different types of noise (Type 1, 2, and 3) affect the utility of the reconstruction. We show one result on the relationship between the utility vs. varying three types of noises. We can observe that the spectral based reconstruction method best preserves the utility with Type 1 noise (i.i.d.) while it incurs the largest utility loss with Type 3 noise (completely correlated). This is because the completely correlated noise 53 Table 3.3: Utility of Reconstructed Adult Data with Type 1 Noise. noise ||E||F /||X||F k=1 k=2 k=3 Utility loss k=4 I(fX , fˆX ) k=5 k=6 E1 0.172 0.137 0.292 0.261 0.259 0.254 *†0.515 E2 0.176 0.137 0.232 0.270 0.213 *0.208 †0.501 E3 0.188 0.152 0.233 0.084 *0.082 0.077 †0.521 E4 0.218 0.309 0.078 0.051 *0.050 0.045 †0.512 E5 0.231 0.089 0.011 *0.221 0.125 0.116 †0.499 E6 0.243 0.137 0.292 *0.196 0.170 0.164 †0.466 E7 0.266 0.306 0.152 *0.143 0.141 0.137 †0.466 E8 0.297 0.152 0.297 *0.289 0.125 0.119 †0.462 E9 0.326 0.088 *0.156 0.226 0.213 0.086 †0.479 Strategy Strategy Information Loss The strength of the noise (||E||F/||X||F) Figure 3.9: Utility vs varying noise with type 1 cannot be well filtered out by the Spectral based reconstruction method although some statistical properties (e.g., the covariance matrix) can be fully preserved. 3.7.6 Lower Bound vs. Privacy Threshold This experiment illustrates how to add noise to satisfy users privacy threshold using the lower bound derived in Section 3.5. In order to preserve privacy, data owners need to make sure ||X̂ − X||F /||X||F is greater than the privacy threshold τ , specified by users. Figure 3.11(a) shows the relationship between the expected privacy threshold τ and the magnitude of noise need to be added. In general, the magnitude of noise needs to be increased gradually when the privacy threshold τ is increased. However, the relationship is not linear. For example, the relative noise strength ||E||F /||X||F 54 with type 1 noise with type 2 noise with type 3 noise Information Loss The strength of the noise (||E||F/||X||F) Figure 3.10: Utility vs. varying noises of three types remains unchanged when the expected privacy level is in the range (0.2, 0.4). As we can recall from Section 3.3, the variance of the added noise is V ar(E) = λE /(m − 1), where the eigenvalue λE should satisfy λXk+1 < λE ≤ λXk and k is determined by 2 2 )/||X||F }. In other words, the expected privacy + · · · + σX k = max{i|τ ≤ (σX n i+1 level relies on the determination of k, which is influenced by the magnitude of noise added. Although the relationship between the reconstruction bias and perturbation can guide us to add noise into the original data set, the lower bound gives data owners the worst case security assurance since it is bounded by any matrix B of rank no greater than k derived by attackers. Figure 3.11(b) shows the relationship between the privacy threshold specified by data owners and the real relative errors achieved by attackers using both Strategy 1 and 2. For example, when the privacy threshold is specified as 0.4, the real relative errors achieved by attackers using Strategy 1 and 2 are around 70% and 82% respectively. 3.7.7 Evaluation of IQR Attack In this section, we explain the breach of individual privacy by the information mined from the perturbed data. To depict such breach, we use the measures defined in the 0 Strategy 1 Strategy 2 Achieved Relative Error (100%) The Strength of the Noise (||E||F/||X||F ) 55 70 60 50 40 30 20 10 0 0 5 Privacy Threshold (a) Expected RE 10 15 20 Privacy Threshold 25 30 35 (b) Achieved RE Figure 3.11: Achieved Reconstruction accuracy vs. varying privacy threshold τ Table 3.4: Stock/bonds from Bank data set with Uniform noise [-125,125], disclosure with 95% IQR, information loss for AS is 14.6% Interval p % 35 40 45 50 55 60 65 70 75 80 no. of disclosed points(100%) direct IQR ideal IQR with AS 13.9 21.2 3.5 16.0 32.5 15.1 17.9 43.0 29.6 19.8 52.9 41.8 22.0 62.9 53.2 23.9 72.9 63.4 26.0 83.3 73.5 28.0 94.3 83.7 29.9 99.9 94.5 32.0 100 100 D ideal 0.605 0.66 0.712 0.763 0.814 0.864 0.916 0.972 0.999 1 AS 0.663 0.698 0.746 0.796 0.844 0.889 0.932 0.977 0.999 1 section 3.6.1 which is based on the estimate distribution of the original data. In our experiments, the perturbed data is generated using both Uniform and Gaussian distributions. For uniform distribution, the random variable is generated from range [−α, α] with mean 0. For Gaussian distribution, the random variable is generated with zero mean and varying standard deviations. Please note that the spectral-filtering-based method (SF) only works with Gaussian distribution while Agrawal and Srikant’s method (AS) generally works with any distribution. In Figure 3.12, we show a case in which we reconstructed stock/bonds distribution from bank data set with the help of AS algorithm. The perturbing distribution is Uniform distribution [-125, 125]. Figure 3.12(a) and 3.12(b) show the density distribution of the 56 10 6 5 4 6 Density (%) Density (%) 8 4 3 2 2 1 0 40 60 80 100 120 Data Value 140 160 0 60 180 80 100 (a) Original 120 140 Data Value 160 180 200 (b) Estimated 2ULJLQDO 3HUWXUEHG 5HFRQVWUXFWHG Density (%) " Data Value (c) Reconstructed using AS Figure 3.12: Reconstructed stock/bands from bank data set using AS algorithm. The noise is Uniform distribution [-125,125]. original data and of reconstructed data using AS method respectively. Figure 3.7.7 shows how the number of full disclosed points varies with the individual’s privacy interval. Since we do not have real privacy interval for each individual data in the original data set, we generate its privacy interval by using [xi (1 − P ), xi (1 + P )] for each value xi of the Stock/Bond column in Bank data set. We experiment many P values ranging from 35% to 80% in our experiments. The confidence threshold for the IQR is taken to be 95% in all our experiments. Recall that the individual data is said to be fully disclosed if it can be estimated with 95% confidence that the IQR interval [x1 , x2 ] covers the original value xi and its privacy interval [wil , wiu ]. In this figure, the curve (labeled as Direct) shows the percentage of points whose perturbed value completely lies in the individual’s privacy interval for each varying P . We can see that they are less than 0.3 for all P values. In other words, Number of Complete Disclosure Points (100%) 57 Direct IQR Ideal IQR with AS Interval Size P (%) Figure 3.13: Disclosure of Bank distribution with Uniform noise (AS algorithm) Table 3.5: Sinusoidal with Gaussian noise (0,8) using AS and SF methods Interval p % 60 65 70 75 80 85 90 95 100 direct 46.6 50.1 53.3 56.4 59.2 62.1 64.6 66.9 69.2 no. of disclosed points(100%) IQR ideal IQR with AS IQR with SF 69.3 62.2 6 72.2 65.0 22.8 72.4 67.7 61.5 75.3 70.5 64.1 78.4 73.2 66.7 81.6 76.0 69.2 90.2 78.9 71.8 1 82.1 74.4 1 85.6 77.0 ideal 0.838 0.855 0.871 0.888 0.906 0.926 0.950 1 1 avg. D AS 0.845 0.856 0.866 0.877 0.889 0.900 0.913 92.6 0.94 SF 0.699 0.735 0.875 0.882 0.889 0.897 0.905 91.3 0.922 perturbation seems to successfully preserve the other 70% individuals’ privacy. However, the number of fully disclosed points is actually much larger when we apply IQR inference. For example, when P = 0.6, 63.4% individual data is fully disclosed by using IQR. On comparison, only 23.9% individual’s perturbation data lies in the original privacy interval. In other words, 39.5% more individual’s data was fully disclosed using IQR inference. In this figure, we also show the number of fully disclosed points when we apply IQR inference on original distribution. This shows one ideal case where we can 100% accurately reconstruct the original distribution. We can see that the more accurate the reconstructed distribution, the more individuals fully disclosed. Number of Complete Disclosure Points (100%) 58 Direct IQR Ideal IQR with AS IQR with SF Interval Size P (%) Figure 3.14: Disclosure analysis on Sinusoidal with Gaussian noise (0,8) using AS and SF methods) Table 3.7.7 shows details on number of disclosed points (direct, IQR ideal, IQR using AS) with varied individual’s privacy interval size (determined by P ). We can see that the number of fully disclosed points in all three cases increases when P increases. Table 1 also shows how average disclosure D varies with P . Recall that D in general measures how close the IQR obtained by attackers or snoopers is from individual’s privacy interval. We can see that D increases when P increases, which indicates more individual’s private information is disclosed. The information loss, I, in terms of distribution, incurred by AS algorithm is 14.6%, which suggests the AS algorithm can closely reconstruct the original data. Since spectral filtering based method only works with Gaussian distribution in principle, we use the second data set and introduce a relative large Gaussian noise in our experiment. Figure 6 and Table 2 show our result on Sinusoidal signal. We perturbed data using Gaussian noise with mean 0 and standard deviance 8. The IQR obtained from AS, SF, and the original distribution is [1.6,4.2], [1.2,4.5], [2.05,3.9] respectively. We can see that IQR inference on reconstructed distributions (AS and SF) can disclose more individuals’ private information than that falsely assumed by perturbation methods. We 59 1 1 Direct IQR with AS 0.98 0.8 0.96 0.7 Average D Number of Disclosure Points (100%) 0.9 0.6 0.5 0.94 0.92 0.4 0.9 0.3 0.88 0.2 0.1 150 200 250 300 350 400 0.86 150 Noise Level 200 250 300 350 400 Noise Level (a) Fully Disclosed Points vs. noise level (b) average disclosure D vs. noise level Figure 3.15: Stock/Bonds of Bank data set perturbed using Uniform distribution also show the percentage of fully disclosed points and D under the ideal case where the exact distribution is assumed to be reconstructed, AS case, and SF case. This experiment shows the SF method generally does not do as well as AS when the noise level is large. However, even IQR based on reconstructed distributions from SF can disclose more individual information than that falsely assumed by data owners. The information loss for in terms of distribution incurred by AS and SF is is 32.9% and 47.0% respectively. Since SF method is claimed to be able to reconstruct individual data, the corresponding level of information loss in terms of individual data is ae = kX − X̃kF = 200.2 and re = kX−X̃kF kXkF = 0.375 respectively. Figure 3.15 shows how disclosure varies with increasing noise for a stock/bonds variable in the bank data set. We perturb this data set using various uniform distributions with range from 150 to 400. In Figure 3.15(a), we plot the number of fully disclosed points (using direct and IQR with AS) with increasing deviation of the perturbed distribution while we plot the average disclosure D with increasing deviation of the perturbed distribution in Figure 3.15(b). We can see that the greater the perturbation, the greater amount of loss of information. The number of fully disclosed points using IQR decreases along the increased information loss. 3.8 Summary Additive Randomization has been a primary tool to hide sensitive private information during privacy preserving data mining. The previous work based on Spectral Filtering 60 technique empirically showed that individual data can be separated from the perturbed one and as a result privacy can be seriously compromised. In this chapter we conducted the theoretical study on how the estimation error varies with the additive noise. In particular, before we conduct our bound analysis, we proposed a new strategy to determine the principal components in spectral-filtering-based method. Our strategy is proven to be more efficient than the previous one in terms of reconstruction error. To bound the construction error, we first derived one upper bound for the Frobenius norm of reconstruction error using the matrix perturbation theory. This upper bound may be exploited by attackers to determine how close their estimates are from the original data using spectral filtering based techniques, which imposes a serious threat of privacy breaches. We then proposed one Singular Value Decomposition (SVD) based reconstruction method and derived a lower bound for the reconstruction error. We then proved the equivalence between the Spectral Filtering based approach and the proposed SVD approach and as a result the achieved lower bound can also be considered as the lower bound of the Spectral Filtering based approach. This lower bound can help data owners determine how much noise should be added to satisfy one given threshold of tolerated privacy breach. In this chapter, we also discussed how to use a possible IQR-based attack to threaten data provider’s privacy. A scenario where each data provider has specified his/her own privacy interval was considered and corresponding privacy-quantification methods were defined. Our experimental results showed the impact on the individual privacy by using different reconstruction methods. CHAPTER 4: DISCLOSURE ANALYSIS OF THE PROJECTION-BASED PERTURBATION In this chapter, our focus is on the projection-based approaches, which can be further classified as distance-preserving-based and non-distance-preserving-based. Distancepreserving-projection-based perturbation can mitigate the privacy/accuracy tradeoff by achieving perfect data mining accuracy. Since the transformation matrix R is required to be orthonormal (i.e., RRT = RT R = I), geometric properties (vector length, inner products and distance between a pair of vectors) are strictly preserved. Hence, data mining results on the rotated data can achieve perfect accuracy. One known-sample-based PCA attack was recently investigated to show the vulnerabilities of this distance-preservingbased projection approach when a sample data set is available to attackers [Liu, Giannella and Kargupta 2006]. As a result, non-distance-preserving-based projection was suggested be applied since it is resilient to the known-sample-based PCA attack with the sacrifice of data mining accuracy to some extent. However, one important issue is whether this approach is also subject to other specific attacks. Intuitively, one might think that the Independent Component Analysis (ICA) could be applied to breach the privacy. It was argued in [Chen and Liu 2005; Liu, Kargupta and Ryan 2006] that ICA is in general not effective in practice due to two basic difficulties in applying ICA directly to the projection-based perturbation. First, there are usually significant correlations among attributes of X. Second, more than one attributes may have Gaussian distributions. To explore the vulnerability of this approach, we proposed an A-priori-Knowledge ICA(AK-ICA) reconstruction method [Guo and Wu 2007], which may be exploited by attackers when a small subset of sample data is available to attackers. The theoretical analysis and empirical evaluation show AK-ICA can effectively recover the original data 62 with high precision when a part of sample data is a-priori known by attackers. Since the proposed technique is quite robust with additive Gaussian noise and the transformation matrix even with a small subset of sample data, it poses a serious concern for all previous randomization-based privacy preserving data mining methods. It suggests all previous projection-based approaches may be insecure to preserve privacy any more when a part of sample data a-priori known by attackers. The rest of this chapter is organized as follows. In Section 4.1 we introduce various projection-based perturbation models which can be classified as distance-preservingbased projection and non-distance-preserving-based projection. To integrate all the existing projection models, a general-linear-transformation-based perturbation model is also introduced in this section. Potential attacks for projection-based perturbations are discussed in Section 4.2 and Section 4.3. In Section 4.2, we discuss direct ICA attack and its drawbacks. A series of known-sample-based attacks are introduced in Section 4.3. Our proposed attack, AK-ICA, as one of effective attacks, is emphasized in this part. We also provide experimental results in Section 4.4 to show the performance of AK-ICA and compare it with other attacking methods. We offer our concluding remarks in Section 4.5. 4.1 Projection-Based Perturbation Models This section offers an overview of projection-based perturbation. We divide various forms of projections into two categories: distance-preserving-based projection and nondistance-preserving-based projection. We fit all existing projection-based perturbations into these categories. Further more, a general-linear-transformation-based perturbation model is proposed to incorporate all above models. 4.1.1 Distance-Preserving-Based Projection In [Chen and Liu 2005], the authors defined a Rotation-Based Perturbation Model, i.e., Y = RX, where R is a d × d orthogonormal matrix satisfying RT R = RRT = I. Example 2 Following the previous example, we still use matrix X to express the original data. The only difference over here is that each column of X represents one customer’s 63 record. The transformation matrix R in this example is a 3 × 3 random orthonormal matrix. Then we get the perturbed data Y whose individual record in each column is quite different from the one in the original data. So the privacy is expected to be well preserved. Y = RX 0.667 0.667 10 15 50 45 ... 80 0.333 = −0.667 0.667 −0.333 85 70 120 23 ... 110 −0.667 −0.333 0.667 2 18 35 134 ... 15 63.67 110.00 119.67 ... 63.33 61.33 = 49.33 30.67 55.00 −59.33 ... −31.67 −33.67 −21.33 −30.00 51.67 ... −51.67 The key features of rotation transformation are preserving vector length, Euclidean distance and inner product between any pair of points, as shown below. |Rx| = |x| |R(xi − xj )| = |xi − xj | < Rxi , Rxj > = < xi , xj > (4.1) where |x| = xT x represents the length of a vector x while < xi , xj >= xTi xj represents the inner product of two vectors xi and xj . Intuitively, geometric patterns such as cluster shape, hyperplane and hyper curved surface in the multidimensional space will therefore be preserved. It can be shown from the following example. Example 3 Figure 4.1 demonstrates a perturbation on a data set with two dimensions. From the original data set, two clusters can be clearly identified. After the projection 0.866 0.5 with an orthonormal matrix R = , those two clusters can still be easily −0.5 0.866 identified due to the preserved properties. 64 % $ $’ Y2 X2 %’ X1 Y1 (a) Before the rotation (b) After the rotation Figure 4.1: Example of rotation-based perturbation It was proved in [Chen and Liu 2005] that three popular classifiers (kernel method, SVM, and hyperplane-based classifiers) are invariant to the rotation-based perturbation due to the preserved geometric properties. Authors in [Oliveira and Zaiane 2004] defined a Rotation-Based Data Perturbation Function that distorts the attribute values of a given data matrix to preserve privacy of individuals. In their approach, the attributes are processed pair by pair. For each pair of selected attributes, the transformation matrix Rp is actually a 2 × 2 orthogonormal matrix in the form of cos θ sin θ Rp = − sin θ cos θ Their perturbation scheme can be expressed as Y = RX where R is a d × d matrix with each row or column having only two no-zero elements, which represent the elements in the corresponding Rp . 65 0 sin θ1 0 cosθ1 ... ... ... ... 0 − sin θ2 0 cos θ2 R= ... ... ... ... − sin θ1 0 cos θ1 0 ... ... ... ... 0 cos θ2 0 sin θ2 ... ... ... ... ... ... ... It is easy to see that the perturbation matrix R here is an orthogonormal matrix when there are even number of attributes. When we have odd number of attributes, according to their scheme, the remaining one is distorted along with any previous distorted attribute, as long as some condition is satisfied. 4.1.2 Non-Distance-Preserving-Based Projection In distance-preserving-based projection model, the transformation matrix R has to be orthonormal in order to preserve distance and other geometric properties. However, in non-distance-preserving-based projection, no such constrain is imposed on the transformation matrix. Without loss of generality, we consider R as an arbitrary random matrix. Example 4 The original data still come from Table 1.1. Different from Example 2, each entry of the transformation matrix is randomly generated from an uniform distribution. Y = RX 4.751 = 1.156 3.034 43.63 = 158.16 322.43 2.429 2.282 10 15 50 45 ... 80 85 70 120 23 ... 110 4.457 0.093 3.811 4.107 2 18 35 134 ... 15 63.25 167.65 204.83 ... 220.68 178.32 421.95 375.55 ... 526.70 307.81 570.05 414.17 ... 536.23 66 Since the transformation matrix does not have to be orthonormal, distance as well as other geometric properties might not be preserved any more. For the first customer, the vector length of his/her original data is 85.61. However, the vector length of the corresponding perturbed data is changed into 361.77. Authors in [Liu, Kargupta and Ryan 2006] proposed a Random Projection-Based Perturbation Model and applied it for privacy preserving distributed data mining. The random matrix Rd×d is generated such that each entry ri,j of R is independent and identically chosen from some normal distribution with mean zero and variance σr2 . Thus, the following properties of the rotation matrix are achieved. E[RT R] = dσr2 I If two data sets X1 and X2 are perturbed as Y1 = √ 1 RX1 dσr and Y2 = √ 1 RX2 dσr perspec- tively, then the inner product of the original data sets will be preserved from the statistic point of view: E[Y1T Y2 ] = X1T X2 Such model can also be extended to the case where R is a k × m transformation matrix. To avoid the weakness of the rotation-based projection, authors in [Chen and Liu 2007] gave an enhanced geometric perturbation model: Y = RX + Ψ + ∆ In this model, two additional components are added: a random translation matrix Ψ and a noise matrix ∆. Ψ is defined as Ψ = t1T , where t = [t1 , t2 , · · · , td ]T , 0 ≤ ti < 1, and 1 = [1, 1, · · · , 1]T . ∆ is defined as ∆ = [δ1 , δ2 , · · · , δN ], where a vector δi is d-dimensional i.i.d. Gaussian variable. Because of these two additional components, the rotation center shall be shifted and geometric properties such as pair-wise distance, vector length and inner product might not be preserved. Authors in [Chen and Liu 2007] also investigated possible attacks for this model and analyzed a linear-regression-based attack when attackers know some 67 points in the original data, as well as the right mapping in the perturbed one. 4.1.3 The General-Linear-Transformation-Based Perturbation The general-linear-transformation-based perturbation model can be described by Y = RX + E (4.2) Where X ∈ Rp×n is the original data set consisting of n data records and p attributes. Y ∈ Rq×n is the transformed data set consisting of n data records and q attributes. R is a q × p rotation matrix while E ∈ Rq×n is a q × n noise matrix. In this chapter, we shall assume R is a square matrix with dimension d for convenience (q = p = d). We also assume the additive noise E is independent with the data X. This assumption is generally held by almost all existing additive-noise-based perturbation methods. The only exception is that Huang et al. in [Huang, Du and Chen 2005] proposed a modified random perturbation, in which random noises are correlated with the original data, in order to defeat PCA-based reconstruction methods. However, we argue that it may significantly affect the accuracy of data mining results if correlated noises are added to the original data. Example 5 The transformation matrix R is chosen randomly as in Example 4. Besides, we also introduce some random noise E after the projection to get the data further 68 perturbed. Y = RX +E 4.751 = 1.156 3.034 7.334 3.759 0.099 265.87 = 394.35 362.59 2.429 2.282 10 15 50 45 ... 80 4.457 0.093 85 70 120 23 ... 110 + 3.811 4.107 2 18 35 134 ... 15 4.199 9.199 6.208 ... 9.048 7.537 8.447 7.313 ... 15.692 7.939 3.678 1.939 ... 6.318 286.57 618.10 581.66 ... 690.55 338.54 604.34 174.31 ... 599.84 394.15 756.44 776.46 ... 729.85 The goal of this general-linear-transformation-based perturbation is to release Y for data mining while preventing attackers to derive X. It combines both projection-based and additive-noise-based approaches and all previous perturbation methods are special cases of this general linear transformation model. 4.2 Direct Attack Intuitively, one might think that the Independent Component Analysis (ICA) could be applied to breach the privacy. When the original data is a collection of independent signals and all of them are non-Gaussian distributed with the exception of one, ICA can be used directly to break the privacy. 4.2.1 ICA Revisited ICA is a statistical technique which aims to represent a set of random variables as linear combinations of statistically independent component variables. Definition 4.5 (ICA model) [Hyvarinen, Karhunen and Oja 2001] ICA of a random vector x = (x1 , · · · , xm )T consists of estimating of the following generative model for the data: x = As 69 or X = AS where the latent variables (components) si in the vector s = (s1 , · · · , sn )T are assumed independent. The matrix A is a constant m × n mixing matrix. The basic problem of ICA is to estimate both the mixing matrix A and the realizations of the independent components si using only observations of the mixtures xj . The following three restrictions guarantee identifiability in the ICA model. 1. All the independent components si , with the possible exception of one component, must be non-Gaussian. 2. The number of observed linear mixtures m must be at least as large as the number of independent components n. 3. The matrix A must be of full column rank. The second restriction, m ≤ n, is not completely necessary. Even in the case where m < n, the mixing matrix A is identifiable whereas the realizations of the independent components are not identifiable, because of the noninvertibility of A. In this chapter, we make the conventional assumption that the dimension of the observed data equals the the number of independent components, i.e., n = m = d. Please note that if m > n, the dimension of the observed vector can always be reduced so that m = n by existing methods such as PCA. The couple (A, S) is called a representation of X. Since X = AS = (AΛP )(P −1 Λ−1 S) for any diagonal matrix Λ (with nonzero diagonals) and permutation matrix P , X can never have completely unique representation. The reason is that, both S and A being unknown, any scalar multiplier in one of the sources si could always be canceled by dividing the corresponding column ai of A by the same scalar. As a consequence, we usually fixes the magnitudes of the independent components by assuming each si has unit variance. Then the matrix A will be adapted in the ICA solution methods to take into account this restriction. However, this still leaves 70 the ambiguity of the sign: we could multiply an independent component by −1 without affecting the model. This ambiguity is insignificant in most applications. 4.2.2 Drawbacks of Direct ICA It was argued in [Chen and Liu 2005; Liu, Kargupta and Ryan 2006] that ICA is in general not effective in breaking the rotation-based perturbation in practice due to two basic difficulties in applying the ICA attack directly to the rotation-based perturbation. First, there are usually significant correlations among attributes of X. Second, more than one attribute may have Gaussian distributions. We would emphasize that these two difficulties are generally held in practice. Example 6 We show the correlation matrix of a bank data set with five attributes as below. 1.000 0.631 0.756 0.501 0.218 0.631 1.000 0.723 Corr(X) = 0.756 0.723 1.000 0.501 0.357 0.237 0.218 0.137 0.112 0.357 0.137 0.237 0.112 1.000 0.440 0.440 1.000 From Example 6, we can observe significant correlations exist among attributes. To measure the non-Gaussianity, we apply the classic kurtosis measure Pn Kurt(z) = − z̄)4 −3 (n − 1)σ 4 i=1 (zi where z̄ is the mean and σ is the standard deviation, and n is the number of data points. The kurtosis is based on the fourth-order statistics and the kurtosis for a standard normal distribution is zero. Below shows the kurtosis of 5 attributes in the bank data set. µ Kurt(X) = ¶ −0.025 6.141 2.867 −0.025 170.593 It is easy to see that attribute 1 and 4 tend to be Gaussian distributed. One explanation why many attributes in practice have Gaussian distributions is that attributes in 71 databases usually are a combination of some other hidden attributes. According to the Central Limit Theorem, the distribution of a sum of independent random variables tends towards a Gaussian distribution. 4.3 Sample-Based Attack Different customers may have different requirements/concerns on their individual data and some customers may have very few concerns on their privacy. Practically, it gives a chance to attackers to collect some individual data and launch their attacks based on the knowledge of those data samples. In this section, we will introduce several potential attacks under such scenario that a small data sample from the same population of X is available to attackers, denoting as X̃. 4.3.1 Attacks for Distance-Preserving-Based Projection In the distance-preserving-based projection model, the transformation matrix is orthonormal: RT R = RRT = I. It seems that privacy is well preserved after rotation, however, a small known sample may be exploited by attackers to breach privacy completely. Known-Sample-Based Regression Attack Let’s consider a case where X ∩ X̃ = X ‡ 6= ∅. It indicates that a subset of the original data is already known by attackers. Since many geometric properties (e.g. vector length, distance and inner product) are preserved, attackers can easily locate X ‡ ’s corresponding part, Y ‡ , in the perturbed data set by comparing those values. From Y = RX, we know the same linear transformation is kept between X ‡ and Y ‡ : Y ‡ = RX ‡ . Once the size of X ‡ is at least rank(X), the transformation matrix R can be perfectly recovered through linear regression. We call it as a linear-regression-based attack. For Example 3 in the Section 4.1.1, only 2 points are needed to break the privacy of the whole data set. We may take points A and B as two customers’ data which were known by attackers. Therefore, the vector lengths and distance between them can be calculated: |A| = 1.8055, |B| = 2.1819 and |A − B| = 0.7179. The preserved geometric properties in the perturbed data can easily help attackers to locate the corresponding transformed points A0 and B 0 by comparing those calculated key values. Finally, a linear 72 regression using A, B, A0 and B 0 will derive the transformation matrix R in this example. Known-Sample-Based PCA Attack For the case where X ‡ = ∅ or too small, attackers may have no or little information about the exact data in X. However, the known sample(X̃) is drawn from the same population as the original data X. By taking advantage of the distribution learnt from the sample, authors in [Liu, Giannella and Kargupta 2006] proposed a Known-Sample-Based PCA Attack. The idea is briefly given as follows. Since the known sample and private data share the same distribution, eigenspaces (eigenvalues) of their covariance matrices are expected to be close to each other. As we known, the transformation here is a geometric rotation which does not change the shape of distributions (i.e., the eigenvalues derived from the sample data are close to those derived from the transformed data). Hence, the rotation angels between the eigenspace derived from known samples and those derived from the transformed data can be easily identified. In other words, the rotation matrix R is recovered. This attack is formally supported by the following theorem in [Liu, Giannella and Kargupta 2006]. To be consistent, we use notations defined in this chapter. Figure 4.2 gives the complete attacking procedure. Theorem 4.5 The eigenvalues of ΣX and ΣY are the same(ΛX = ΛY ) and the transformation on X and Y is the transformation on the corresponding eigenvectors(RQX = QY D), where R is the orthonormal transformation matrix applied in the perturbation, D is a diagonal matrix with diagonal entries be either 1 or -1, ΣX and ΣY are covariance matrices of X and Y which have eigenvalue decompositions as ΣX = QX ΛQTX and ΣY = QY ΛQTY respectively. 4.3.2 Attacks for Non-Distance-Preserving-Based Projection The important step in the regression-based attacking method shown above is to locate the sample’s corresponding perturbed data from Y . Those preserved geometric properties in the distance-preserving-based projection model can help attackers to do so. However, for non-distance-preserving-based projection model and the general model with random 73 input Y , n × m matrix, a given perturbed data set X̃, n × p matrix, a given subset of original data output X̂, an estimation of the original data set BEGIN 1 Computing covariance matrix ΣX̃ from X̃ and covariance matrix ΣY from Y . 2 Doing eigenvalue decomposition on ΣX̃ and ΣY to get their eigenvectors. ΣX̃ = QX̃ ΛX̃ QTX̃ ΣY = QY ΛY QTY 3 Choosing D = argmaxG(QY DQTX̃ X̃, Y ), where G is a function to test the similarity between distributions of two data set. 4 Estimating X as: X̂ = QX̃ DQTY Y END Figure 4.2: Known-Sample-Based PCA Attack transformation matrix, R, those properties might not be preserved any more. In this part we present an effective attack which may be exploited by attackers. Although we can not apply ICA directly to estimate X from the perturbed data, we will show that there exists a possible attacking method based on ICA when a subset of the original data is available. AK-ICA Attack In [Guo and Wu 2007], we showed that attackers can reconstruct X closely by applying an proposed A-priori-Knowledge ICA based attack (AK-ICA) when a (even small) sample of data, X̃, is available to attackers. Let X̃ ⊂ X be this sample data set consisting of k data records and d attributes. The core idea of AK-ICA is to apply the traditional ICA on the known sample data set, X̃, and perturbed data set, Y , to get their mixing matrices and independent components respectively, and reconstruct the original data by exploiting the relationships between them. Figure 4.3 shows the procedure of such attack. The first step of this attack is to derive ICA representations, (Ax̃ , Sx̃ ) and (Ay , Sy ), from the a-priori known subset X̃ and the perturbed data Y respectively. Since in general we can not find the unique representation of (A, S) for a given X (recall that 74 input Y , a given perturbed data set X̃, a given subset of original data output X̂, a reconstructed data set BEGIN 1 Applying ICA on X̃ and Y to get X̃ = Ax̃ Sx̃ Y = Ay S y 2 Deriving the transformation matrix J by comparing the distributions of Sx̃ and Sy 3 Reconstructing X approximately as X̂ = Ax̃ JSy END Figure 4.3: AK-ICA Attack X = AS = (AΛP )(P −1 Λ−1 S) for any diagonal matrix Λ and perturbation matrix P in Section 4.2.1), S is usually required to have unit variance to avoid scale issue in ICA. As a consequence, only the order and sign of the signals S might be different. In the following, proofs are given to show that there exists a transformation matrix J such that X̂ = Ax̃ JSy is an estimate of the original data X. We also present how to identify J with an example. Existence of Transformation Matrix J To derive the permutation matrix J, let us first assume X is given. Applying the independent component analysis, we get X = Ax Sx where Ax is the mixing matrix and Sx is independent signal. Corollary 2 The mixing matrices Ax , Ax̃ are expected to be close to each other and the underlying signals Sx̃ can be approximately regarded as a subset of Sx . Ax̃ ≈ Ax Λ1 P1 Sx̃ ≈ P1−1 Λ−1 1 S̃x (4.3) 75 Proof. Considering an element xij in X, it is determined by the i-th row of Ax , ~ai , and the j-th signal vector, ~sj , where ~ai = (ai1 , ai2 , · · · , aid ) and ~sj = (s1j , s2j , · · · , sdj )T . xij = ai1 s1j + ai2 s2j + · · · + aid sdj Let ~x̃p be a column vector in X̃ which is randomly sampled from X. Assume ~x̃p = ~xj , then the i-th element of this vector, x̃ip can also be expressed by ~ai and the corresponding signal vector ~sj . x̃ip = ai1 s1j + ai2 s2j + · · · + aip spj Thus, for a given column vector in X̃, we can always find a corresponding signal vector in S and reconstruct it through the mixing matrix Ax . Since Sx is a set of independent components, its sample subset S̃x ⊂ Sx can also be regarded as a set of independent components of X̃ when the sample size of X̃ is large. There exists a diagonal matrix Λ1 and a permutation matrix P1 such that X̃ = Ax̃ Sx̃ ≈ Ax S̃x = (Ax Λ1 P1 )(P1−1 Λ−1 1 S̃x ) Ax̃ ≈ Ax Λ1 P1 Sx̃ ≈ P1−1 Λ−1 1 S̃x Corollary 3 Sx and Sy are similar to each other and there exists a diagonal matrix Λ2 and a permutation matrix P2 that Sy = P2−1 Λ−1 2 Sx Proof. Y = RX = R(Ax Sx ) = (RAx )Sx Since permutation may affect the order and phase of the signals Sy , we have Y = Ay Sy ≈ (RAx Λ2 P2 )(P2−1 Λ−1 2 Sx ) (4.4) 76 By comparing the above two equations, we have Ay ≈ RAx Λ2 P2 Sy ≈ P2−1 Λ−1 2 Sx Theorem 4.6 Existence of J. There exists one transformation matrix J such that X̂ = Ax̃ JSy ≈ X (4.5) where Ax̃ is the mixing matrix of X̃ and Sy is the independent components of the perturbed data Y . Proof Since ˜ Sx̃ ≈ P1−1 Λ−1 1 Sx Sy ≈ P2−1 Λ−1 2 Sx and S̃x is a subset of Sx , we can find a transformation matrix J to match the independent components between Sy and Sx̃ . Hence, JP2−1 Λ−1 = P1−1 Λ−1 2 1 J = P1−1 Λ−1 1 Λ2 P2 From Equation 4.3 and 4.4 we have X̂ = Ax̃ JSy −1 −1 ≈ (Ax Λ1 P1 )(P1−1 Λ−1 1 Λ2 P2 )(P2 Λ2 Sx ) = Ax S x = X Determining J The ICA model given in Definition 4.5 implies no ordering of the independent components. The reason is that, both s and A being unknown, we can freely change the order of the terms in the sum in Definition 4.5, and call any of the independent component as 77 the first one. Formally, a permutation matrix P and its inverse can be substituted in the model to give another solution in another order. As a consequence, in our case, the i-th component in Sy may correspond to the j-th component in Sx̃ . Hence we need to figure out how to find the transformation matrix, J. Since Sx̃ is a subset of Sx , each pair of corresponding components follow similar distributions. Hence our strategy is to analyze distributions of two signal data sets, Sx̃ and Sy . As we discussed before, the signals derived by ICA are normalized signals. So the scaler for each attribute is either 1 or -1. It also can be easily indicated by the distributions. (i) (j) Let Sx̃ and Sy denote the i-th component of Sx̃ and the j-th component of Sy and 0 let fi and fj denote their density distribution respectively. In this chapter, we use the information difference measure I to measure the similarity of two distributions [Agrawal and Agrawal 2001]. 1 I(fi , fj ) = E[ 2 Z 0 0 ΩZ | fi (z) − fj (z) | dz] (4.6) The above metric equals half the expected value of L1 -norm between the distribution of the i-th component from Sx̃ and that of the j-th component from Sy . It is also equal to 1 − α, where α is the area shared by both distributions. The smaller the I(f, f 0 ), the more similar between one pairs of components. The matrix J is determined so that 0 0 0 J[f1 , f2 , · · · , fd ]T ≈ [f1 , f2 , · · · , fd ]T . In the following, we illustrate how it works using an example. Example 7 The data set X in this example contains 5 attributes and 50,000 records. From X, 1,000 records are randomly extracted, denoting as X̃. The original data X is also perturbed by Y = RX. Applying ICA on X̃ and Y , we get their ICA representation (Ax̃ , Sx̃ ) and (Ay , Sy ) re(i) (j) spectively. For each pair of components between Sx̃ and Sy , we apply the information difference as shown in Equation 4.6 to compute their similarity. The derived transfor- 78 0. 4 Density Density 0. 4 0. 2 0. 2 0 0 - 7. 35 - 6. 30 - 5. 25 - 4. 20 - 3. 15 F1 - 2. 10 - 1. 05 - 9. 7 - 8. 5 - 7. 3 - 6. 1 - 4. 9 - 3. 7 - 2. 5 - 1. 3 - 0. 1 0. 00 F2 (1) (2) (a) Sx̃ : component 1 (b) Sy 1. 1 component 2 Figure 4.4: Distribution of component mation matrix J is shown as: 0 1 0 0 (3) (4) (3) (3) 1 0 0 0 J = 0 0 −1 0 0 0 0 0 0 0 0 −1 (1) (2) 0 0 0 1 0 (5) which means the components, [Sx̃ , Sx̃ , Sx̃ , Sx̃ , Sx̃ ], derived from X̃ correspond to (2) (1) (3) (5) (4) (5) (4) [Sy , Sy , Sy , Sy , Sy ] respectively and (Sx̃ , Sy ) and (Sx̃ , Sy ) have -1 scalar differ(1) (2) ence. Figure 4.4 shows the density distributions of two matched components (Sx̃ , Sy ). When we have X, its ICA representation (Ax , Sx ) can also be derived. From Corollary 1 and 2, we can get P1−1 Λ−1 1 and Λ2 P2 as follows. 0 1 0 P1−1 Λ−1 1 0 0 1 0 0 0 0 = 0 0 1 0 0 0 0 0 0 −1 0 0 0 −1 0 1 0 Λ2 P2 0 0 0 0 1 0 0 0 = 0 0 −1 0 0 0 0 0 1 0 0 0 0 0 −1 79 We can easily check the derived transformation matrix J equal to P1−1 Λ−1 1 Λ2 P2 . 4.3.3 Attacks for General Projection The general-linear-transformation-based perturbation model is an integration of additivenoise-based perturbation model and projection-based perturbation model. Therefore, it absorbs some properties of both models and brings more randomness to the protected data. As previous works [Huang, Du and Chen 2005; Kargupta et al. 2003] indicated, the Spectral Filtering or PCA-based method works very well for the additive-noise-based perturbation. However, it is impossible to breach the privacy from the projection-based perturbation since spectral properties can hardly be preserved in the perturbed data, especially when the original data is projected to a lower dimensional space. Thus, we can not filter out the noise by extracting eigenvectors(eigenvalues) of the original X from the rotated data Y . It is also hard to derive the linear transformation matrix in nondistance-preserving-based projection model by simply analyzing the spectral information either. Since noise-free ICA model can be extended to a noisy-ICA model, the AK-ICA can also be extended to attack the general-linear-transformation-based perturbation by solving noisy-ICA problem. Definition 4.6 (Noisy ICA model) [Hyvarinen, Karhunen and Oja 2001] ICA of a random vector x = (x1 , · · · , xm )T consists of estimating of the following generative model for the data: x = As + e or X = AS + E where the latent variables (components) si in the vector s = (s1 , · · · , sn )T are assumed independent. The matrix A is a constant m × n mixing matrix, and e is a m-dimensional random noise vector. 80 PCA, as well as Spectral Filtering, is a purely second-order statistical method: only covariances between the observed variables are used in the estimation. This is due to the assumption of Gaussianity of the components. The components are further assumed to be uncorrelated, which also implies independence in the case of Gaussian data. On the contrary, components in ICA are assumed to be statistically independent and nonGaussian. The spectral filtering method [Kargupta et al. 2003] based on random matrix theory and PCA-based reconstruction method [Huang, Du and Chen 2005] can only be used to approximately reconstruct the private data from the additive-noise-based perturbation (i.e., Y = X + E). The reason is that both methods utilize the spectral properties of the randomized data to separate additive noise from the original data. To separate the additive noise from the perturbed data, some properties of E (e.g., covariance matrix) must be known. Even though it is possible to remove a part of the noise on some dimensions, the spectral properties are helpless to the rotation. On the contrary, the AK-ICA attack can effectively reconstruct the data from the general linear transformed data (i.e., Y = RX + E and R can be any rotation matrix) when only a small subset of sample is available to attackers. The attackers do not even need any information of E. Experiments in the next section will show that the AK-ICA can even outperform the spectral-filtering-based method for the additive-noise-based perturbation when the strength of E is large. In other words, spectral filtering or PCA-based methods are usually not robust with relative large noises. The AK-ICA method is expected to be more robust with Gaussian noise since the objective functions used in ICA methods are higher-order statistics (e.g., Kurtosis). Both ICA and PCA methods formulate a general objective function that define the interestingness of a linear representation, and then maximize that function. Both are related to factor analysis, though under the contradictory assumptions of Gaussianity and non-Gaussianity, respectively. PCA uses only second-order statistics, while ICA can use both second and fourth-order cumulants. 81 4.4 Evaluation In our AK-ICA method, we applied JADE package3 implemented by Jean-Francois Cardoso to conduct ICA analysis. JADE is one cumulant-based batch algorithm for source separation [Cardoso 1999]. Since our AK-ICA attack can reconstruct individual data in addition to its distribution, in this study we cast our accuracy analysis in terms of both matrix norm and individualwise errors. We measure the reconstruction errors using the following measures. d n 1 X X xij − x̂ij RE = | | d × n i=1 j=1 xij n RE-Ri 1 X xij − x̂ij | = | n j=1 xij RE-Cj 1 X xij − x̂ij | = | d i=1 xij i = 1, · · · , d d j = 1, · · · , n F -AE(X, X̂) = kX̃ − XkF F -RE(X, X̂) = kX̃ − XkF kXkF where X, X̂ denotes the original data and the estimated data respectively, and k · kF denotes a Frobenius norm 4 . All the above measures show how closely one can estimate the original data X from its perturbed data Y . Here we follow the tradition of using the difference as the measure to quantify how much privacy is preserved. Basically, RE (relative error) represents the average of relative errors of individual data points. RE-Ri represents the average of relative errors of the i-th attribute while RE-Cj represents the average of relative errors of the j-th record. Since we have 50,000 records in our bank data set, we only list the minimum and maximum values of RE-Cj in our results (see Table 4.1, 4.2 and 4.4) . However, we would point out that RE-Cj is an important measure since it shows privacy breach of each individual customer. F -AE (F -RE) denotes the absolute (relative) errors 3 4 http://www.tsi.enst.fr/icacentral/algos.html qP Pn d 2 The Frobenius norm of X: kXkF = i=1 j=1 xij . 82 between X and its estimation X̂ in terms of Frobenius norm, which gives perturbation evaluation a simplicity that makes it easier to interpret. 4.4.1 Effect of Noise and the Transformation Matrix In this experiment, we evaluate the performance of AK-ICA on the general scenario Y = RX + E. For this scenario, both R and E determine the perturbation of the original data. We will see how the reconstruction accuracy is affected by these two factors by changing the strength of noise E in following four cases with various transformation matrices. • Case 1: R = I • Case 2: RT R = I • Case 3: R1 is a random matrix with det(R1) = 0.444, kR1kF = 3.167. • Case 4: R2 is another random matrix with det(R2) = 2.48 × 109 , kR2kF = 281.8. For the Case 3 and Case 4, both R1 and R2 were generated randomly with significantly different determinant and Frobeniums norm values. We apply the term Signal-to-Noise Ratio (SNR) to quantify the relative amount of noise added to actual data. SN R = 20log kXkF kEkF In all four cases, we have 1000 known samples and introduce additive Gaussian noises from 20db to -5db in terms of SNR. The SNR of 0db means that the noise variance equals to the original data. In Figure 4.5, we plot the reconstruction error (RE) with increasing SNR for all the above four cases. We include complete results in terms of various reconstruction error measures for the above four cases in Table 4.1. In all four cases, we can observe that the proposed AK-ICA based construction method is quite robust when a small or medium noise is added. However, with the large noise (SNR < 5db), the reconstruction accuracy is noticeably degraded. We would explore further how ICA methods based on higher-order statistics are affected by the large Gaussian noise in the future. case 4 random R2 case 3 random R1 case 2 RT R = I case 1 R=I Y = RX + E F -AE 3641 3387 3390 5285 9283 14538 3493 3371 3775 6232 15808 19067 3415 3787 3576 5448 14018 12895 3453 3495 3574 5438 13249 12349 SNR 20 15 10 5 0 -5 20 15 10 5 0 -5 20 15 10 5 0 -5 20 15 10 5 0 -5 0.108 0.100 0.100 0.156 0.274 0.430 0.103 0.099 0.112 0.184 0.468 0.464 0.101 0.112 0.106 0.161 0.415 0.382 0.102 0.103 0.106 0.161 0.392 0.365 F -RE 0.1362 0.124 0.118 0.127 0.230 0.464 0.133 0.134 0.151 0.235 0.318 0.358 0.125 0.143 0.128 0.188 0.281 0.373 0.129 0.131 0.127 0.189 0.273 0.443 RE 1 0.072 0.067 0.071 0.152 0.265 0.415 0.068 0.065 0.075 0.152 0.433 0.312 0.067 0.076 0.075 0.133 0.418 0.369 0.068 0.069 0.073 0.132 0.426 0.417 2 0.102 0.101 0.093 0.086 0.122 0.513 0.099 0.088 0.070 0.084 0.243 0.304 0.104 0.097 0.093 0.089 0.166 0.217 0.103 0.102 0.089 0.093 0.161 0.161 RE-Ri 3 0.127 0.112 0.095 0.129 0.232 0.200 0.118 0.097 0.079 0.107 0.470 0.394 0.120 0.124 0.098 0.115 0.406 0.319 0.118 0.112 0.104 0.114 0.382 0.068 4 0.331 0.285 0.256 0.158 0.290 0.766 0.327 0.348 0.413 0.621 0.243 0.396 0.291 0.352 0.292 0.443 0.202 0.316 0.311 0.312 0.289 0.442 0.190 1.272 5 0.049 0.055 0.075 0.113 0.242 0.423 0.055 0.074 0.116 0.209 0.199 0.382 0.044 0.064 0.082 0.162 0.213 0.144 0.045 0.057 0.079 0.168 0.204 0.297 Table 4.1: Reconstruction error vs. SNR for four cases when k = 1000 RE-Cj min max 0.053 8.191 0.034 6.149 0.037 1.163 0.034 2.193 0.0687 1.839 0.164 1.934 0.043 6.103 0.045 3.228 0.037 1.013 0.060 8.784 0.078 22.85 0.231 1.443 0.052 1.013 0.049 1.169 0.043 1.213 0.053 1.264 0.066 1.265 0.07 2.060 0.045 1.039 0.048 1.374 0.046 0.872 0.053 1.576 0.062 1.438 0.109 2.984 83 84 R=I RR 7 = I R 5 R RE 0. 0. 2 0. 0 −5 0 5 10 15 20 SNR (db) Figure 4.5: The effect of noise E for RE From Table 4.1 and Figure 4.5 , we may also observe the proposed AK-ICA reconstruction method is very stable across all the four cases, which means it is insensitive to the selection of rotation matrix R. In other words, once attackers have a sample subset X̃, they can always get similar estimates no matter how database owners want to change Y by choosing different R. This is because Sy is stable and close to the Sx since ICA technique itself is robust to the rotation matrix R. This is a major advantage over the spectral filtering or PCA-based reconstruction methods since they can only deal with the additive noise case (R = I). 4.4.2 Effect of the Sample Size In this experiment, we work on the scenario Y = RX without additive noise E. We also fix the rotation matrix R as orthogonormal matrix to satisfy RT R = I. It can be randomly generated based on Haar distribution [Stewart 1980]. We change the sample size k of X̃ from 20 to 2000. Please note that all chosen k values are small compared with the size of the original data. Figure 4.6 shows the reconstruction error (in terms of F -RE and RE in Figure 4.6(a) and RE-Ri for each attribute in Figure 4.6(b)) decreases when the sample size k is F -AE 12791 8574 7748 7304 7224 6463 5444 5225 3829 3458 2297 Sample Size 20 50 80 100 200 300 400 500 800 1000 2000 0.378 0.254 0.231 0.216 0.214 0.191 0.161 0.155 0.113 0.102 0.068 F -RE 0.372 0.286 0.238 0.225 0.198 0.187 0.170 0.126 0.128 0.128 0.094 RE 1 0.529 0.068 0.101 0.104 0.238 0.088 0.147 0.058 0.073 0.068 0.042 2 0.075 0.222 0.314 0.054 0.259 0.049 0.113 0.097 0.034 0.105 0.009 RE-Ri 3 0.112 0.372 0.233 0.309 0.116 0.294 0.133 0.263 0.045 0.125 0.041 4 1.019 0.458 0.387 0.435 0.293 0.418 0.298 0.194 0.230 0.301 0.250 5 0.122 0.311 0.153 0.225 0.084 0.087 0.158 0.014 0.259 0.038 0.128 RE-Cj min max 0.192 9.024 0.181 4.267 0.093 5.209 0.144 3.233 0.094 4.898 0.092 5.244 0.079 4.245 0.095 0.998 0.087 0.848 0.059 0.970 0.076 1.351 Table 4.2: Reconstruction error vs. sample size(k) when Y = RX 85 86 0.4 F−RE RE Attribute Attribute Attribute Attribute Attribute 1 0.35 0.8 0.3 RE-Ri 0.25 0.2 0.6 0.4 0.15 0.2 0.1 0.05 0 500 1000 1500 0 2000 0 500 1000 Sample size K 1500 2000 Sample size K (a) F-RE and RE (b) RE-Ri Figure 4.6: Reconstruction error vs. varying known sample size k under Y = RX 0.5 Attribute Attribute Attribute Attribute Attribute F−RE RE 0.45 1.2 0.4 1 RE−R i 0.35 0.3 0.8 0.5 0.25 0.4 0.2 0.2 1 2 3 4 5 6 7 8 9 10 Round (k = 50) (a) F-RE and RE 0 1 2 3 4 5 6 7 8 9 10 Round (b) RE-Ri Figure 4.7: Reconstruction error vs. random samples with the fixed size k = 50 increased. The similar trend also holds for RE-Cj for each record, which we show the minimum and maximum values in Table 4.2. This is because that the more sample data we have, the more match between derived independent components. When we have 500 known records (which account for 1% of the original data), we could achieve very low reconstruction error (F -RE = 0.155, RE = 0.126). When the sample size is decreased, more errors are introduced. However, even with only 20 known samples (which account for 0.04% of the original data), we can still achieve very close estimations for some attributes (e.g., RE-Ri = 0.075 for attribute 2). 87 Especially, when k is small, we also evaluate how different sample sets X̃ with the same size k affect AK-ICA reconstruction method. Here we randomly chose 10 different sample sets with the fixed size k = 50. Figure 4.7 shows the construction errors with 10 different sample sets. The performance of our AK-ICA reconstruction method is not very stable in this small sample case. For example, the first run achieves 0.1 of F -RE while the third run achieves 0.44 as shown in Figure 4.7(a). The instability here is mainly caused by Ax̃ which is derived from X̃. Since Y = RX +E is fixed, the derived Sy doesn’t change. We also observed that for each particular attribute, its reconstruction accuracy in different rounds is not stable either. As shown in Figure 4.7(b), the attribute 5 has the largest error among all the attributes in round 5, however, it has the smallest error in round 7. This is because the reconstruction accuracy of one attribute is mainly determined by the accuracy of its estimate of he corresponding column vector in Ax̃ . This instability can also be observed in Figure 4.6(b). 4.4.3 Comparing AK-ICA and Known-Sample-Based PCA Attack In this experiment, we evaluate the reconstruction performance of AK-ICA and the known-sample-based PCA attack in [Liu, Giannella and Kargupta 2006]. Since the known-sample-based PCA attack could not handle the additive noise, we compare these two attacking methods on the scenario Y = RX with no noise involved. We fix the sample ratio as 1% and apply different transformation matrices. Here R is expressed as R = R1 + cR2 , where R1 is a random orthonormal matrix, R2 is a random matrix with uniformly distributed elements([-0.5,0.5]) and c is a coefficient. Initially, c is set as 0 which guarantees the orthonormal property for R. By increasing c, R gradually loses orthonormal property and tends to be an arbitrary transformation. From Figure 4.8(a) and 4.8(b) we can observe that our AK-ICA attack is robust to various transformations. The reconstruction errors do not change much when the transformation matrix R is changed to more non-orthonormal. On the contrary, the PCA attack only works when R is orthonormal or close to orthonormal. When the transformation tends to be more non-orthonormal (with the increase of c as shown in 88 Table 1), the reconstruction accuracy of PCA attack degrades significantly. For example, when we set c = 5, the relative reconstruction errors of PCA attack are more than 200% (F -RE=2.1414 , RE = 2.1843) while the relative reconstruction errors of AK-ICA attack are less than 20% (F -RE=0.1444 , RE = 0.1793). 2.5 2.5 ICA Attack PCA Attack 2 2 1.5 RE F−RE 1.5 1 1 0.5 0.5 0 0 −0.5 ICA Attack PCA Attack 0 1 2 3 c (a) F-RE 4 5 −0.5 0 1 2 3 4 5 c (b) RE Figure 4.8: Reconstruction error of AK-ICA vs. PCA attacks by varying R 4.4.4 Comparing AK-ICA and Spectral-Filtering-Based Attack To compare AK-ICA with the spectral-filtering-based method [Kargupta et al. 2003], we choose additive-noise-based perturbation model with no projection (Y = X + E). As we introduced in Chapter 3, the spectral filtering method assumes the covariance matrix of E be given in order to separate the principal components from the perturbed data. The reconstruction accuracy of spectral-filtering-based method is mainly determined by how well the principal components can be separated from the perturbed data. Table 4.4 shows how the spectral filtering method works when we vary the strength of additive noise E from 20db to -5db. It shows the reconstruction accuracy is decreased significantly when a large noise is introduced. We plot the comparison of reconstruction accuracy with increasing the strength of E between AK-ICA and the spectral filtering in Figure 4.9. We assume 1, 000 records are available to attackers. We can see the spectral filtering method outperforms AK-ICA when an relatively small noise (SNR > 5db ) is introduced while AK-ICA outperforms ||cR2 ||F ||R1 ||F 0 0.1299 0.1988 0.3121 0.3011 0.4847 0.539 0.804 c 0 0.2 0.3 0.4 0.5 0.7 1 1.25 AK-ICA F -RE RE 0.0824 0.1013 0.1098 0.1003 0.0701 0.0618 0.1336 0.1631 0.1867 0.2436 0.1227 0.1188 0.065 0.0606 0.1177 0.1399 PCA F -RE RE 0.013 0.0126 0.0451 0.0448 0.1288 0.1247 0.1406 0.1305 0.1825 0.1704 0.2415 0.2351 0.35 0.334 0.5565 0.5695 1.5 2 2.5 3 3.5 4 4.5 5 c 0.8059 1.2755 1.5148 1.9321 2.1238 2.4728 3.049 3.4194 ||cR2 ||F ||R1 ||F AK-ICA F -RE RE 0.1533 0.169 0.1709 0.1523 0.0816 0.1244 0.1142 0.1373 0.1303 0.1566 0.1249 0.1314 0.0707 0.0543 0.1444 0.1793 PCA F -RE RE 0.3336 0.3354 0.7598 0.7368 0.8906 0.8946 0.6148 0.592 1.631 1.6596 1.5065 1.5148 1.0045 0.9815 2.1414 2.1843 Table 4.3: Reconstruction error of AK-ICA vs. PCA attacks by varying R 89 90 when an relatively large noise (SNR ≤ 5db) is introduced. We would emphasize again that the spectral filtering (and PCA-based) method can only reconstruct under the additive noise case (i.e., Y = X + E) while our AK-ICA approach works robust in all cases (i.e., Y = RX + E). 0.7 SF ICA 0.6 0.5 RE 0.4 0.3 0.2 0.1 0 −5 0 5 10 15 20 SNR (db) Figure 4.9: Reconstruction error vs. SNR for SF and AK-ICA (with fixed size k = 1000) when Y = X + E 4.5 Summary In this chapter, we have examined the effectiveness of general projection in privacy preserving data mining. It was suggested in [Liu, Giannella and Kargupta 2006] that the non-isometric projection approach is effective to preserve privacy since it is resilient to the PCA attack which was designed for the distance preserving projection approach. We proposed an AK-ICA attack, which can be exploited by attackers to breach the privacy from the non-isometric transformed data. Our theoretical analysis has shown the proposed attack poses a threat to all projection based privacy preserving methods when a small sample data set is available to attackers. We argued that it is really a concern that we need to address in practice. F -AE 711.5 813.1 2371 5285 13213 17719 SNR 20 15 10 5 0 -5 0.021 0.024 0.070 0.156 0.390 0.523 F -RE 0.018 0.032 0.081 0.156 0.457 0.611 RE 1 0.017 0.030 0.052 0.093 0.173 0.296 2 0.016 0.029 0.052 0.278 0.463 0.678 RE-Ri 3 0.021 0.037 0.066 0.119 0.2111 0.372 4 0.022 0.038 0.182 0.195 0.599 0.793 5 0.016 0.029 0.051 0.094 0.838 0.917 RE-Cj min max 0.005 0.133 0.009 0.209 0.016 3.026 0.035 2.997 0.337 11.44 0.535 1.470 Table 4.4: Reconstruction error vs. SNR for spectral filtering method when Y = X + E 91 CHAPTER 5: DISCLOSURE ANALYSIS OF THE MODEL-BASED PRIVACY PRESERVING APPROACH The issue of confidentiality and privacy in general databases has become increasingly prominent in recent years. Disclosures that can occur as a result of inferences by snoopers include two classes: identity disclosure and value disclosure. Identity disclosure relates to the disclosure of the identity of an individual in the database while value disclosure relates to the disclosure of the value of a certain confidential attribute of that individual. To prevent disclosures, various randomization based approaches (e.g., [Adam and Wortman 1989; Agrawal and Srikant 2000; Palley and Simonoff 1987; Sarathy and Muralidhar 2002]) have been investigated. A key element in preserving privacy and confidentiality of sensitive data is the ability to evaluate the extent of all potential disclosure for such data. In other words, we need to be able to answer to what extent confidential information in a perturbed or transformed database can be compromised by attackers or snoopers. This is a major challenge for current randomization based approaches. To evaluate the privacy and confidentiality residing in general databases which contain both categorical attributes and numerical attributes, the authors in [Wu, Wang and Zheng 2005] proposed a general framework for modeling general databases using the General Location model. One advantage of the general location model is it can be used to conduct both identity disclosure and value disclosure respectively since it integrates both categorical attributes and numerical attributes in one model. The general location model is defined in terms of the marginal distribution of categorical attributes and the conditional distribution of numerical attributes given each cell determined by categorical attributes. The former is described by a multinomial distribution on the cell count when we summarize the categorical part as a multi-dimensional contingency table. The numerical attributes of tuples in each cell are assumed to follow a multivariate normal 93 distribution with its parameters µ, Σ, where µ is a vector of means and Σ is a covariance matrix. It is no wondering that those parameters (e.g., µ, Σ) may be used by attackers or snoopers to derive some confidential information. For example, from one distribution such as ”the wages of customers from zip=28223 and race = Asian follow a normal distribution with mean 70k and standard variance 10k”, snoopers can safely derive a 95% coverage interval as [50.4k, 89.6k]. This derived coverage interval may violate customers’ privacy requirement. To continue this line of the previous work we focus on value disclosure which can occur as a result of inferences by attackers or snoopers from the multivariate normal distributions in this dissertation. Furthermore, we will consider various factors in general databases and conduct disclosure analysis for the following scenarios. • Basic disclosure scenario - All numerical attributes contained in databases are sensitive attributes. Various correlations exist among those attributes. • Conditional disclosure scenario - Databases contain other non-confidential numerical attributes apart from those confidential ones. Here we assume non-confidential attributes are non-perturbed as they may be retrieved accurately by snoopers from other public sources. One problem arises as the snoopers may exploit the relationship between non-confidential attributes and confidential attributes to predict individual values of confidential attributes. • Linear combination scenario - Databases contain many linear combinations among both confidential and non-confidential numerical attributes. The combinations here can be either known or hidden. Many organizational databases typically contain numerous attributes that could lead themselves to potentially thousands of linear combinations. In this case, the level of security provided for linear combinations of confidential attributes could be very low even if the level of security provided for a single confidential attribute is adequate. Value disclosure represents the situation where snoopers are able to estimate or infer the value of a certain confidential numerical attribute of an entity or a group of entities 94 with a level of accuracy greater than a pre-specified level. In our scenario, all numerical attribute values are modeled by multi-variate normal distributions. Here multi-variate normal distribution itself is not considered as confidential information, only the parameters µ, Σ may be considered as confidential. The first issue is how to check whether a given set of µ, Σ, which are used for data generation, provides adequate security for confidential numerical attributes for an entity or a group of entities. The second issue is how to modify µ, Σ when they violate privacy and confidentiality requirements. 5.1 The General Location Model Revisited Let C = {C1 , C2 , · · · , Cq } denote a set of categorical attributes and Z = {Z1 , Z2 , · · · , Zp } a set of numerical ones in a table with n entries. Suppose Aj takes possible domain values 1, 2, · · · , dj , the categorical data C can be summarized by a contingency table with total Q number of cells equal to D = qj=1 dj . Let y = {yd : d = 1, 2, · · · , D} denote the number P of entries in each cell. Clearly D d=1 yd = n. The general location model [Schafer 1997] is defined in terms of the marginal distribution of C and the conditional distribution of Z given C. The former is described by a multinomial distribution on the cell counts y, y | π ∼ M (n, π) = n! π 1 y1 · · · π D yD y1 ! · · · yD ! where π = {πd : d = 1, 2, · · · , D} is an array of cell probabilities corresponding to yd . For each cell Cd , where d = 1, 2, · · · , D, defined by the categorical attributes C, the numerical attributes Z are then modeled as a conditionally multi-variate normal as: f (z|Cd ) = 1 p/2 (2π) 1/2 | Σd | e−1/2(z−µd ) T Σd −1 (z−µd ) where p-dimensional vector µd represents the expected value of the random vector z = (z1 , z2 · · · , zp )T for cell Cd , and the p × p matrix Σd is its variance-covariance matrix. The parameters of the generation location model can be written as θd = (πd , µd , Σd ), d = 1, 2, · · · , D. 95 The maximum likelihood estimates of θ is as follows: π̂d = yd n µ̂Td = yd−1 xd X ziT i=1 Σ̂d = yd−1 xd X (zi − µ̂d )(zi − µ̂d )T (5.1) i=1 Here we would emphasize that it is feasible to model various data using the general location model although a group of data may follow some other distributions (e.g., Zipf, Poisson, Gamma etc.) in practice[Schafer 1997]. As we can see we define multi-variate normal distribution for data at the finest level and data at higher levels can be taken as a mixture of multi-variate normal distributions, hence we can theoretically use a mixture of multi-variate normal distributions to model any other distributions. It is straightforward to see we can easily generate a dataset when the parameters of general location model are given. Generally, it involves two steps. First, we estimate the number of tuples in each cell Cd and generate yd tuples. All yd tuples from this cell have the same categorical attribute values inherited from the cell location of the contingency table. Second, we apply multi-variate normal distribution with mean vectors and covariance matrix to generate numerical attribute values for those tuples in that cell. 5.2 Disclosure Controls For Numerical Data Value disclosure represents the situation where snoopers are able to estimate or infer the value of a certain confidential numerical attribute of an entity or a group of entities with a level of accuracy greater than a pre-specified level. Here an entity or a group of entities can be characterized by cell they are located in. In our context, all numerical attribute values are generated from multi-variate normal distributions. As we discussed before, multi-variate normal distribution itself is not considered confidential information, only the parameters µd , Σd which are used for data modeling may contain confidential information. The first issue is how to check whether a given set of µd , Σd , which are used for data generation, provides adequate security for confidential numerical attributes for an entity or a group of entities. The second issue is 96 how to modify µd , Σd when they violate privacy and confidentiality requirements. 5.2.1 Basic Disclosure Scenario From Result 1in 3.6.2, we know the ellipsoid {z : (z − µ)T Σ−1 (z − µ) ≤ χ2p (α)}, which is yielded by the paths of z values, contains a fixed percentage, (1 − α)100% of customers. In our scenario, snoopers may use various techniques to estimate and predict the confidential values of individual customers. However, all confidential information which snoopers can learn is the bound of ellipsoid. Assume E is the ellipsoid from the original data z at one given confidence level 1 − α. From the perturbed data ẑ, snoopers can derive the ellipsoid Ê. Equation 5.2 defines the measure of disclosure of z when z̃ is given. D(z | z̃) = | vol(E ∩ Ê) | | vol(E ∪ Ê) | (5.2) Here compromise is said to occur when D(z | z̃) is greater than τ , specified by database owner. The greater the D(z | z̃), the closer the estimates are to the true distribution, and the higher the chance of disclosure. In other words, if the ellipsoid learned by snoopers is close enough to that specified by database owners, we say partial disclosure occurs. To compute the volume of density contour, we have the following results as shown in Proposition 3. Please note if our interest is in just a few confidential attribute(say r attributes), we can easily project the ellipsoid in the original p-dimensional space to the lower s-dimensional space by replacing z, µ, and Σ in Proposition 3 with zs , µs , and Σs respectively. Proposition 3 (Volume of density contour) The volume of an ellipsoid {z : (z− µ)T Σ−1 (z − µ) ≤ χ2p (α)} is given by q p vol(E) = η( χ2p ) | Σ1/2 | or p q pY p 2 vol(E) = η( χp ) λi i=1 where η is the volume of the unit ball in Rp , and λi is the i-th eigenvalue of matrix Σ. 97 Proof. From Result 2 in 3.6.2, we know the volume of an ellipsoid {z : (z − µ)T A−1 (z − µ) ≤ 1} is given by vol(E) = η| A1/2 |. We replace A with Σ/χ2p (α), then we get q p vol(E) = η( χ2p ) | Σ1/2 | From spectral decomposition of Σ as shown in Equation 5.3, Σ= p X λi ei ei T = P ΛP T (5.3) i=1 we get |Σ| = |P ΛP T | = |P P T ||Λ| As P P T = I, then we have |Σ| = |Λ|. Q √ Due to |Σ1/2 | = |Λ|1/2 = pi=1 λi , hence we have p q pY p vol(E) = η( χ2p ) λi i=1 Z2 c λ1 c λ2 µ2 µ1 Z1 Figure 5.1: A constant density contour for a bi-variate normal distribution Example 8 Figure 5.1 shows one constant density contour containing 95% of the proba z1 bility under the ellipse surface for one bi-variate z = , which follows a bi-variate z2 µ1 σ11 σ12 λ1 normal distribution N (µ, Σ) with µ = and Σ = . λ = µ2 σ21 σ22 λ2 √ √ is the eigenvalues of covariance matrix Σ and two axes have length of c λ1 and c λ2 98 Z2 Z2 µ2 µ2 µ1 µ1 Z1 (a) Roy’s confidence intervals Z1 (b) Bonferroni confidence intervals Figure 5.2: Confidence Intervals respectively, here c = 2.45 as p χ22 (0.05) = √ 5.99 = 2.45. We can see the major axis of ellipse is associated with the largest eigenvalue (λ1 ). The size of this ellipse is given by √ 5.99 λ1 × λ2 , as χ22 (0.05) = 5.99. However, to evaluate the measure of disclosure, D(z|z̃), as shown in Equation 5.2, we need to compute the volume of the intersection (or union) of two ellipsoid. This problem is shown as NP-hard and some approximation techniques were surveyed in [Henrion, Tarbouriech and Arzelier 2001]. One heuristic we apply here is to use a hyper-rectangle to approximate the ellipsoid. As we know computing the intersection (or union) of two hyper-rectangle in high dimensional space is straightforward. Figure 5.2(a) shows one Roy’s rectangle formed by the projection of ellipse on z1 and z2 while Figure 5.2(b) shows one Bonferroni’s rectangle [Johnson and Wichern 1998] which formed by simultaneously testing the hypothesis about z1 and z2 with an overall conservative significance level ( α2 ). We are conducting the comparison between our method with other approximation techniques. In many applications, database owner usually specifies a confidential range [z l , z u ] (z is a confidential numerical attribute) for an entity or a group of entities. In this case we use the projection of ellipse on each axis to check whether disclosure occurs. Similarly, the snoopers can learn confidence interval [ẑl , ẑu ] for each numerical attribute by projecting ellipse on each axis. If the confidence interval [ẑl , ẑu ] derived by snoopers are close to the confidential range [z l , z u ] specified by database owner, we say value disclosure occurs. 99 d(z | ẑ) = [z l , z u ] ∩ [ẑ l , ẑ u ] [z l , z u ] ∪ [ẑ l , ẑ u ] (5.4) Like the measure we defined in 3.6.2, Equation 5.4 defines the measure of disclosure for one confidential attribute. To compute the projection of one ellipsoid on each axis, we have the following results as shown in Proposition 4. Proposition 4 (Simultaneous Confidence Intervals) Let vector Z be distributed as Np (µ, Σ) with |Σ| > 0. The projection of this ellipsoid {z : (z − µ)T Σ−1 (z − µ) ≤ χ2p (α)} on axis zi = (0, · · · , 1, · · · , 0)T (only the i-th element is 1, all other elements are 0) has q q bound: [µi − χ2p (α)σii , µi + χ2p (α)σii ] Proof. From Result 3 in 3.6.2, we know the projection of an ellipsoid {z : z T A−1 z ≤ c2 } √ on a given unit vector ` has length len = c `0 A`. We replace A with σ · · · σ1i · · · σ1p 11 . . . . . Σ= σi1 · · · σii · · · σip . . . . . σp1 · · · σpi · · · σpp q , replace ` as zi = (0, · · · , 1, · · · , 0)T , and replace c as χ2p (α), then we get the length q of projection as len = χ2p (α)σii . Considering the center of this ellipsoid, we have the q q bound as [µi − χ2p (α)σii , µi + χ2p (α)σii ]. To check whether a given distribution of z may incur value disclosure, our strategy here is to compare the disclosure measure D(z|z̃) or d(z|z̃) with τ , specified by the database owner. If disclosure occurs, we need to modify parameters µ, Σ. As we know from Proposition 4, the mean vector µ determines the center of ellipsoid or the center of projection interval while the covariance matrix Σ determines the size of ellipsoid or the length of projection interval. As the change of µ will significantly affect the data distribution (it will affect the accuracy of analysis or mining subsequently), in the remainder of this section we focus only on how to change variance matrix Σ to satisfy users’ security requirements. 100 It is easy to see from Proposition 4 that the confidence interval for each attribute (by projecting on each axis) is only dependent on µi , σii while it is independent with covariance values σij , where i 6= j. Figure 5.3 illustrates how the shape of ellipse changes when we vary its covariance matrix Σ using one bi-variate normal distribution example. For example, by varying σ12 while fixing σ11 , σ22 as shown in Figure 5.3(a), the axes of ellipse rotate and the ratio between these two axes also changes. However, the projection of ellipse on axis z1 , z2 does not change as it only depends on σ11 and σ22 respectively. Figure 5.3(b) and 5.3(c) illustrate the projection of ellipse only changes when the corresponding variance (σ11 or σ22 ) is changed. q q From the bound [µi − χ2p (α)σii , µi + χ2p (α)σii ], we can adjust σii to satisfy the q given privacy requirement of a confidential attribute Zi : [z l , z u ] ⊆ [µi − χ2p (α)σii , µi + q χ2p (α)σii ]. Since we keep the mean values unchanged, we have zu − zl 2 ) 2 2 (z u − z l ) ≥ 4χ2p (α) χ2p (α)σii ≥ ( σii (5.5) Discussion It is clear that the study of a few confidence intervals is no substitute for the full confidence region. However, such a confidence region could be visualized only in two or three dimensions. Thus, for a higher dimension we may have to be content with confidence intervals. In this chapter, we conduct disclosure analysis by comparing the best confidence interval (or region) derived by snoopers with the confidence interval (or region) specified by the database owner at the same confidence level (e.g., 95%). The previous randomization based approaches are to check whether the probability of confidential attribute z ∈ [z l , z u ] exceeds a pre-defined confidence threshold (e.g., 95%). If yes, it means snoopers can confidently predict the confidential value z within the confidential range, which incurs value disclosure. We can see these two strategies are equivalent. 101 5.2.2 Conditional Scenario Consider a database with k numerical, confidential attributes X = (X1 , X2 , · · · , Xk )T and l non-confidential attributes S = (S1 , S2 , · · · , Sl )T where p = k + l. Security is measured by the degree to which a snooper can determine the values of confidential attributes in a specific record through the use of relationships between the non-confidential and confidential attributes. One question we ask here is how much information is contained in non-confidential attributes and how it affects the variability of confidential numerical attributes. Proposition 5(Conditional normal distribution)([Johnson andWichern 1998]) X µX ΣXX ΣXS Let Z = be distributed as Np (µ, Σ) with µ = , Σ = S µS ΣSX ΣSS and |ΣSS | > 0. Then the conditional distribution of X given S = s is normal with mean −1 = µX + ΣXS Σ−1 SS (s − µS ) and covariance = ΣXX − ΣXS ΣSS ΣSX . Proposition 5 shows the conditional distribution of X given S is also a multi-variate normal distribution. Furthermore, the conditional covariance ΣXX − ΣXS Σ−1 SS ΣSX does not depend upon the values of the conditioning variables. Hence we can simply apply results from Proposition 3 and 4 by replacing Σ with the new conditional covariance ΣXX − ΣXS Σ−1 SS ΣSX to conduct conditional value disclosure analysis. Let A = ΣXX − ΣXS Σ−1 SS ΣSX . Using the same strategy 5.5 proposed in 5.2.1, we adjust those diagonal entries (variances) of A to satisfy the given privacy requirement. The adjusted covariance matrix is denoted as Ã. Therefore, the adjusted covariance matrix for the confidential attributes is naturally derived: Σ̃XX = ΣXS Σ−1 SS ΣSX + à (5.6) We will discuss the strategy to adjust covariance matrix ΣXS or ΣSX in Section 5.2.3. In general, given a confidential variable x whose variance is σx2 , the largest eigenvalue −1 (λ) of Σ−1 XX ΣXS ΣSS ΣSX , gives the proportion of the variance(fluctuation) of variable x that is predictable from non-confidential attributes S. The eigenvalue is a measure of how well the non-confidential attributes can predict the confidential attribute x, e.g., λ = 0.852 Z2 Z2 102 16 20 Σ= 12 20 Σ= 20 20 Σ= 2 12 12 30 20 30 2 30 16 12 µ2 8 µ2 8 4 4 0 0 4 7 µ1 10 20 12 Σ= 12 30 30 12 Σ= 12 30 10 12 Σ= 12 30 0 0 14 4 7 µ1 Z1 Z2 (a) Vary σ12 10 14 Z1 (b) Vary σ11 16 20 12 Σ= 12 30 20 12 Σ= 12 40 20 12 Σ= 12 16 12 µ2 8 4 0 0 4 7 µ1 10 14 Z1 (c) Vary σ22 Figure 5.3: Density contour with varied covariance matrix means that 85% of the total variation in x can be explained by the linear relationship between S and x. The other 15% of the total variation in x remains unexplained. Hence, 0.5 a rough estimate of the smallest standard error can be determined as [(1 − λ)σx2 ] . Based on this estimate of standard error, a rough 95% confidence interval for x is given as: µ̂x ± 1.96[(1 − λ)σx2 ] 5.2.3 0.5 Combination Scenario Many organizational databases typically contain numerous attributes that could lead themselves to potential thousands of linear combinations. In this case, the threat of combination disclosure can be magnified further. For example, the prediction of the linear combination Total Income = Wages + Interests + Dividends is likely to have a high level of accuracy than that of each individual attribute. The approach we apply here is based on Canonical Correlation Analysis (CCA) [John- 103 son and Wichern 1998]. CCA can be used to measure the maximum proportion of the variance that can be explained in any linear combination of confidential attributes X, using a linear combination of known non-confidential attributes S. The main task of CCA is to summarize the associations between the X and S sets in terms of a few carefully chosen covariances rather than the pq covariances in ΣXS . We denote the respective linear combinations by u = aT x and v = bT s. The correlation between u and v is given by Corr(u, v) = aT ΣXS b [(aT ΣXX a)(bT ΣSS b)]1/2 Out of the infinite number of linear combinations, we find the set of linear combinations −1/2 which maximizes the correlation Corr(u, v). The canonical variate pair ui = eTi ΣXX x √ −1/2 and vi = fiT ΣSS s maximizes Corr(ui , vi ) = λi , where i = 1, · · · , l. Here λ1 ≥ · · · ≥ −1/2 −1/2 λl are the eigenvalues of ΣSS ΣSX Σ−1 XX ΣXS ΣSS , and e1 , · · · , el are the associated normalized eigenvectors as shown in Equation 5.7. −1/2 −1/2 A = ΣSS ΣSX Σ−1 XX ΣXS ΣSS = λ1 e1 e1 T + · · · + λl el el T (5.7) The largest eigenvalue λ1 is the squared canonical correlation coefficient, which represents the most general measure of inferential value disclosure for any combination. In other words, 1 − λ1 represents the worst-case security. When some λi is greater than the threshold λ∗ , specified by database owners, it means some combination disclosure exists for one potential combination of confidential attributes. In this case, we need to change parameters, ΣSX or ΣXX , so that all new eigenvalues should be less than or equal to the threshold λ∗ . Our approach here is we set those eigenvalues λi as λ∗ (hence no combination disclosure exists) and keep the other eigenvalues (λi < λ) and all eigenvectors unchanged. We get a new matrix à after applying the inverse of spectral decomposition as shown in Equation 5.7. The derived matrix à is guaranteed to satisfy users’ security requirement for all the possible combinations. Furthermore, as we keep all eigenvectors and those other eigenvalues (λi ≤ λ∗ ) unchanged in our approach, the density contour of modified distribution will be closest to that of the original one. 104 From Equation 5.7, we know Ã, which will satisfy users’ security requirements, is determined by ΣXX and ΣXS . So we can adjust either ΣXX or ΣXS to achieve Ã. Note ΣSS should be kept unchanged as we assume data of non-confidential attributes are non-perturbed. To adjust ΣXX , we simply set Σ̃XX as −1/2 −1/2 Σ̃XX = ΣXS ΣSS Ã−1 ΣSS ΣSX −1/2 (5.8) −1/2 However, there is no direct method to adjust ΣXS . From ΣSS Σ̃SX Σ−1 XX Σ̃XS ΣSS = Ã, we expand the left side of equation and get an l × l matrix. Each element of this matrix is a quadratic function fij (x11 , · · · xlk ) which equals to the corresponding ãij . Then we get l × l sub-quadratic equations with l × k variables. The problem becomes the following optimization problem: Problem 1 Minimize F (x11 , · · · , xlk ) = Σli=1 Σkj=1 (fij (x11 , · · · , xlk ) − ãij )2 subject to xij ≥ 0. 5.3 Summary Various disclosure scenarios may exist in the general databases. In order to preserve the privacy for individual users, a general location model was built for database application testing. The numerical data in the general databases are modeled with different multivariate normal distributions. In this chapter, we focused on the disclosure control for the numerical data in this model. We presented how to satisfy users’ privacy requirements by adjusting parameters of the model learned. Discussion has been made under three different scenarios: basic scenario, conditional scenario and combination scenario. In the basic scenario, disclosure is control by considering confidential attributes alone. In the conditional scenario, non-confidential attributes are considered and the conditional distributions of confidential attributes are adjusted to satisfy privacy concerns. In the combination scenario, potential combination disclosure is controlled using canonical correlation analysis. CHAPTER 6: CONCLUSIONS AND FUTURE WORK 6.1 Summary Driven by one of the major policy issues of the information era- the right to privacy, Privacy-Preserving Data Mining (PPDM) becomes one of the newest trends in privacy and security research. Great interests have been found from both academia and industry: a) the recent proliferation in PPDM techniques is evident; b) the interest from academia and industry has grown quickly; c)separate workshops and conferences devoted to this topic have emerged in the last few years. Privacy issues have posed new challenges for novel uses of data mining technology. Instead of releasing the original data directly for analysis, a complex process shall be applied to protect the sensitive data before sharing the data for mining. One of primary tools for such data protection process is randomization. We expect the privacy and data utility can be well balanced in the context. In this study, we addressed the problem of balancing the privacy and data utility by analyzing different perturbation models. In additive-noise-based perturbation model, spectral-filtering-based technique has recently been investigated as a major means of point-wise data reconstruction [Huang, Du and Chen 2005; Kargupta et al. 2003]. It was empirically shown that under certain conditions this technique may be exploited by attackers to breach the privacy protection offered by randomization based privacy preserving data mining methods. We presented a theoretical study on evaluating privacy breaches when spectral-filtering-based technique is applied. We gave an explicit upper bound of reconstruction accuracy in terms of Frobenius norm. This upper bound may be exploited by attackers to determine how close their estimates are from the original data using spectral-filtering-based technique, which imposes a serious threat of privacy breaches. We also derived an explicit lower bound of reconstruction accuracy in terms of Frobenius norm. This lower bound can help 106 users determine how much and what kind of noise should be added when one tolerated privacy breach threshold is given. In projective-transformation-based perturbation model, isometric projection was proven to be invariant for many popular classifiers. However, it was suggested in [Liu, Giannella and Kargupta 2006] that the non-isometric projection approach is effective to preserve privacy since it is resilient to the PCA attack which was designed for the distance preserving projection approach. We proposed an AK-ICA attack, which can be exploited by attackers to breach the privacy from the non-isometric transformed data. Our theoretical analysis and empirical evaluations have shown the proposed attack poses a threat to all projection based privacy preserving methods when a small sample data set is available to attackers. We argue this is really a concern that we need to address in practice. Considering all the potential threats for the microdata, a model-based approach was designed. Instead of controlling the privacy based on the actual data, model-based approach fits the original data into a carefully designed model and controls the privacy based on parameter space of the model. We focused on the numerical part of our model and provided disclosure control schemes for three different scenarios: basic scenario, conditional scenario and combination scenario. 6.2 Contributions We now summarize the main contributions achieved in this research. As parts of a novel framework for privacy preserving data mining, our contributions lie in: 1. Bound analysis of the spectral filtering technique. In particular, we first derived one upper bound for the Frobenius norm of reconstruction error using the matrix perturbation theory. This upper bound may be exploited by attackers to determine how close their estimates are to the original data using spectral filtering based techniques, which imposes a serious threat of privacy breaches. We also derived a lower bound for the reconstruction error, which can help data owners determine how much noise should be added to satisfy one given threshold of tolerated privacy breach. Besides, an improved data reconstruction strategy for noise filtering was also given. In the context of the additive-noise-based perturbation, our proposed 107 new strategy compares the benefit due to inclusion of one component with the loss due to the additional projected noise. We showed that such strategy is expected to give an approximately optimal reconstruction from the perturbed data. 2. An effective attacking method to break general projective-transformation-based perturbation. By combining a known small subset of the original data, which is reasonable in practice, our algorithm, AK-ICA, can effectively estimate the whole original data set with high accuracy. The nice properties of such attacking method also include its robustness to arbitrary projective transformation. All the previous perturbation methods under this context are vulnerable to our attack. Therefore, current projective transformed privacy preserving data mining techniques may need a careful scrutiny in order to prevent privacy breaches when a subset of sample data are available. 3. A measure for the disclosure of individual privacy. We proposed a way to measure how close the IQR obtained by attackers or snoopers is to individuals privacy interval for some particular sensitive variable. We also extended such measure to multivariate case based on the confidential region. 4. Disclosure control methods for various scenarios in a model based privacy preserving data mining. General databases typically contain numerous attributes with different privacy concerns. To satisfy different privacy requirements from data providers, we analyzed potential privacy disclosures in several scenarios and found ways to adjust parameters of the model learned from the underlying data. 6.3 Future Research Several directions can be exploited as a continuation of this research. We discuss some extensions we are going to touch and technical challenges we would like to address in this section. 1. To explore the potential attacks for additive-noise-based randomization. Spectral filtering technique was proven to be an effective tool to estimate original sensitive 108 values from the perturbed data. However, the noise is hard to be filtered out when the noise is correlated with the original data or signal-to-noise ratio is low. To improve existing data reconstruction algorithms for the additive-noise-based model and to explore other possible attacks by combining various techniques (e.g. statistical approach, signal processing etc.) will be an attractive direction for the future research. 2. To explore how the properties of the sample affect the estimate accuracy for randomization. In randomization approach, original data is perturbed in various ways and only perturbed data are provided for analyzing. If a small set of sensitive data is available in practice, such known sample might be combined to threaten the privacy. The known sample could be some columns (insensitive attributes), or some rows (individual records), or, more generally, some cell values. To explore how the known sample affects the privacy, more research effort is needed which might combine various domains of knowledge, including randomization and approximation in linear algebra, multivariate statistics, etc. 3. To investigate the end-user-oriented privacy preserving data mining. Each end-user may have different privacy concern when sharing the data. For example, Alice only allows her salary to be perturbed to her acceptable range, or her zip code to be trans-formed to the one in other states. Previous randomization models perturb the original data set as a whole without combining various individual privacy concerns. Due to the complex con-text and the dependence between the original value and the noise (or transformation), it is challenging to build a good data mining model to exact interesting patterns from the observed perturbed data. 4. To investigate the effect of randomization on the utility of mining tasks. An essential goal of privacy preserving data mining is to balance the preserved privacy and utility of the data. Adding more noise might make the data more secure, however, it might also sacrifice its worth for the miner. In our future work, we will emphasis the data utility by investigating different mining tasks which are based on the 109 distribution learnt from the perturbed data. 5. To apply randomized response (RR) technique for numerical data in privacy preserving data mining. Randomized response is considered as an efficient tool to protect privacy. Early RR models were designed for categorical data which can be naturally partitioned into mutually exclusive and exhaustive classes. We also noticed that several other models were proposed as extensions for numerical data and corresponding statistical analysis was given [Poole 1974; Duffy and Waterton 1984; Poole and Clayton 1982]. Since numerical data is our focus in this study, in the future research, we would like to address mining issues under this context. In particular, we will investigate the accuracy of mining tasks from the scrambled response. 6. To enhance privacy for existing systems by applying multiple security and privacypreserving techniques, e.g. randomization, cryptography, secure multiparty computation and access control. 110 REFERENCES Adam, N. and Wortman, J. 1989 Security-control methods for statistical databases. ACM Computing Surveys, 21, Nr. 4, 515–556 Aggarwal, C. and Yu, P. 2004 A condensation approach to privacy preserving data mining. In Proceedings of International Conference on Extending Database Technology. Springer Berlin / Heidelberg, 183–199 Agrawal, D. and Agrawal, C. 2001 On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the 20th Symposium on Principles of Database Systems. Agrawal, R. and Srikant, R. 2000 Privacy-preserving data mining. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Dallas, Texas, 439– 450 Ashley, P. et al. 2002 E-P3P privacy policies and privacy authorization. In Proceedings of the 2002 ACM workshop on Privacy in the Electronic Society. New York, NY, USA: ACM Press, ISBN 1–58113–633–1, 103–109 Backes, M., Pfitzmann, B. and Schunter, M. 2003 A toolkit for managing enterprise privacy policies. In 8th European Symposium on Research in Computer Security (ESORICS 2003)., 162–180 Benenson, Z., Freiling, F. and Kesdogan, D. 2005 Secure Multi-Party Computation with Security Modules. In Proceedings of SICHERHEIT 2005., 41–52 Cardoso, J. 1999 High-order contrasts for independent component analysis. Neural Computation, 11, Nr. 1, 157–192 Chang, L. and Moskowitz, I. S. 2000 An Integrated Framework for Database Privacy Protection. In Proceedings of the fourteenth Annual IFIP WG 11.3 Working Conference on Database Security., 161–172 Chaudhuri, A. and Mukerjee, R. 1988 Randomized Response: Theory and Techniques. Marcel Dekker, Inc Chen, K. and Liu, L. 2005 Privacy preserving data classification with rotation perturbation. In Proceedings of the 5th IEEE International Conference on Data Mining. Houston,TX Chen, K. and Liu, L. 2007 Towards Attack-Resilient Geometric Data Perturbation. In Proceedings of the 7th Society for Industrial and Applied Mathematics (SIAM)International Conference on Data Mining. Minneapolis, Minnesota Clifton, C. et al. 2003 Tools for Privacy Preserving Distributed Data Mining. ACM SIGKDD Explorations Newsletter, 4, Nr. 2, 28–34 Commission, E. 1998a Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data. hURL: http: //ec.europa.eu/justice home/fsj/privacy/law/index en.htmi 111 Commission, U. F. T. 1998b Children’s Online Privacy Protection Act. hURL: http: //www.ftc.gov/ogc/coppa1.htmi Congress, U. 1996 Health Insurance Portability and Accountability Act. hURL: http: //www.cms.hhs.gov/HIPAAGenInfo/i Congress, U. 1999 Gramm-Leach-Bliley Act. hURL: http://www.ftc.gov/privacy/ privacyinitiat%ives/glbact.htmli Conover, W. 1998 Practical Nonparametric Statistics. Wiley Cox, L. 1980 Suppresion Methodology and Statistical Disclosure Control. Journal of the American Statistical Association, 75, Nr. 370, 377–385 Dalenius, T. and Reiss, S. P. 1982 Data-swapping: A technique for disclosure control. Journal of Statistical Planning and Inference, 6, 73C85 Dempster, A. P., Laird, N. M. and Rubin, D. B. 1977 Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, Nr. 1, 1–38 Denning, D. E., Schlörer, J. and Wehrle, E. 1982 Memoryless Inference Controls for Statistical Databases., 38–45 Domingo-Ferrer, J. and Mateo-Sanz, J. M. 2002 Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Transactions on Knowledge and Data Engineering, 14, Nr. 1, 189–201, ISSN 1041–4347 Domingo-Ferrer, J. and Torra, V. 2003 On the connections between statistical disclosure control for microdata and some artificial intelligence tools. Information SciencesInformatics and Computer Science: An International Journal, 151, 153–170, ISSN 0020–0255 Du, W. and Zhan, Z. 2003 Using randomized response techniques for privacy-preserving data mining. In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 505–510 Du, W., Han, Y. S. and Chen, S. 2004 Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification. In Proceedings of the 4th SIAM International Conference on Data Mining. ACM Press New York, NY, USA, 222–233 Duffy, J. C. and Waterton, J. J. 1984 Randomized response models for estimating the distribution. function of a quantitative character. International Statistical Review, 52, Nr. 2, 165–171 Duncan, G. T. and Mukherjee, S. 2000 Optimal Disclosure Limitation Strategy in Statistical Databases: Deterring Tracker Attacks through Additive Noise. Journal of the American Statistical Association, 95, 720–729 Evans, T., Zayatz, L. and Slanta, J. 1998 Using noise for disclosure limitation of establishment tabular data. Journal of Official Statistics, 14, Nr. 4, 537–551 Evfimievski, A. et al. 2002 Privacy preserving mining of association rules. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Canada, 217–228 112 Fienberg, S. E. and McIntyre, J. 2003 Data swapping: Variations on a theme by dalenius and reiss. National Institute of Statistical Sciences, Research Triangle Park, NC – Technical report Fienberg, S. E., Makov, U. E. and Steele, R. J. 1998 Disclosure Limitation Using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics, 14, Nr. 4, 485–502 Fischer-Hübner, S. 2001 IT-security and privacy: design and use of privacy-enhancing security mechanisms. Lecture Notes in Computer Science 1958, ISBN 3–540–42142–4 Fisz, M. 1963 Probability theory and mathematical statistics. John Wiley and Sons, Inc. G. J. McLachlan, T. K. 1998 The EM Algorithm and Extensions. The Statistician, 47, Nr. 3, 554 – 555 Gennaro, R., Rabin, M. O. and Rabin, T. 1998 Simplified VSS and fast-track multiparty computations with applications to threshold cryptography. In PODC ’98: Proceedings of the seventeenth annual ACM symposium on Principles of distributed computing. New York, NY, USA: ACM Press, ISBN 0–89791–977–7, 101–111 Gilburd, B., Schuster, A. and Wolff, R. 2004 A New Privacy Model and AssociationRule Mining Algorithm for Large-Scale Distributed Environments. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA Goldreich, O., Micali, S. and Wigderson, A. 1987 How to Play Any Mental Games. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing., 218–229 Gouweleeuw, J. et al. 1998 Post Randomisation for Statistical Disclosure Control: Theory and Implementation. Journal of Official Statistics, 14, Nr. 4, 463–478 Grotschel, M., Lovasz, L. and Schrijver, A. 1988 Geometric Algorithms and Combinatorial Optimization. Springer, New York Guo, L., Guo, S. and Wu, X. 2007 Privacy Preserving Market Basket Data Analysis. In Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases. Warsaw, Poland, 103–114 Guo, S. and Wu, X. 2006 On the Use of Spectral Filtering for Privacy Preserving Data Mining. In Proceedings of the 21st ACM Symposium on Applied Computing. Dijion,France Guo, S. and Wu, X. 2007 Deriving Private Information from Arbitrarily Projected Data. In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Nanjing, China Guo, S., Wu, X. and Li, Y. 2006a Deriving Private Information from Perturbed Data Using IQR based Approach. In 2nd International Workshop on Privacy Data Management. Atlanta, USA 113 Guo, S., Wu, X. and Li, Y. 2006b On the lower bound of reconstruction error for spectral filting based privacy preserving data mining. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06). Berlin,Germany Henrion, D., Tarbouriech, S. and Arzelier, D. 2001 LMI approximations for the radius of the intersection of ellipsoids: a survey. Journal of Optimization Theory and Applications 108, Nr. 1 Huang, Z., Du, W. and Chen, B. 2005 Deriving private information from randomized data. In Proceedings of the ACM SIGMOD Conference on Management of Data. Baltimore, MA Hyvarinen, A., Karhunen, J. and Oja, E. 2001 Independent Component Analysis. John Wiley & Sons Johnson, R. and Wichern, D. 1998 Applied Multivariate Statistical Analysis. Prentice Hall Kam, J. B. and Ullman, J. D. 1977 A model of statistical database their security. ACM Transactions on Database Systems, 2, Nr. 1, 1–10, ISSN 0362–5915 Kargupta, H. et al. 2003 On the Privacy Preserving Properties of Random Data Perturbation Techniques. In Proceedings of the 3rd International Conference on Data Mining., 99–106 Karjoth, G., Schunter, M. and Waidner, M. 2002 The platform for enterprise privacy practices - privacyenabled management of customer data. In Proceedings of the 2nd Workshop on Privacy Enhancing Technologies (PET 2002). Springer-Verlag New York, Inc., 69–84 Karjoth, G. and Schunter, M. 2002 A Privacy Policy Model for Enterprises. In Proceedings of the 15th IEEE workshop on Computer Security Foundations. Washington, DC, USA: IEEE Computer Society, ISBN 0–7695–1689–0, 271 Kozlov, M. K., Tarasov, S. P. and Khachian, L. G. 1979 Polynomial solvability of convex quadratic programming. Soviet Mathematics Doklady, 20, 1108–1111 LeFevre, K., DeWitt, D. J. and Ramakrishnan, R. 2006 Mondrian Multidimensional KAnonymity. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06). Washington, DC, USA: IEEE Computer Society, ISBN 0–7695–2570–9, 25 Li, N., Li, T. and Venkatasubramanian, S. 2007 t-Closeness: Privacy Beyond kAnonymity and l-Diversity. In Proceedings of the 23rd International Conference on Data Engineering (ICDE’07). Washington, DC, USA: IEEE Computer Society, ISBN 1–4244– 0803–2, 106–115 Lindell, Y. and Pinkas, B. 2002 Privacy preserving data mining. Journal of Cryptology, 15, Nr. 3, 177–206 Liu, K., Giannella, C. and Kargupta, H. 2006 An attacker’s view of distance preserving maps for privacy preserving data mining. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06). Berlin,Germany 114 Liu, K., Kargupta, H. and Ryan, J. 2006 Random projection based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transaction on Knowledge and Data Engineering, 18, Nr. 1, 92–106 Machanavajjhala, A. et al. 2006 l-diversity: Privacy beyond k-anonymity. In Proceedings of the 22nd International Conference on Data Engineering. Atlanta, GA, 24 Malvestuto, F. M., Moscarini, M. and Rafanelli, M. 1991 Suppressing marginal cells to protect sensitive information in a two-dimensional statistical table (extended abstract). In PODS ’91: Proceedings of the tenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. New York, NY, USA: ACM Press, ISBN 0–89791–430–9, 252–258 Meng, D., Sivakumar, K. and Kargupta, H. 2004 Privacy-Sensitive Bayesian Network Parameter Learning. In Proceedings of the 4th IEEE International Conference on Data Mining. IEEE Computer Society, Washington, DC, USA, 487–490 Mirsky, L. 1960 symmetric gage function and unitarily invariance. Quarterly Journal of Mathematics, 11, Nr. 1, 50–59 Mukherjee, S. and Duncan, G. T. 1997 Disclosure Limitation through Additive Noise Data Masking: Analysis of Skewed Sensitive Data. In HICSS ’97: Proceedings of the 30th Hawaii International Conference on System Sciences. Washington, DC, USA: IEEE Computer Society, ISBN 0–8186–7743–0, 581 Oliveira, S. and Zaiane, O. 2004 Achieving privacy preservation when sharing data for clustering. In Proceedings of the Workshop on Secure Data Management in a Connected World. Toronto,Canada, 67–82 Özsoyoglu, G. and Chung, J. 1986 Information Loss in the Lattice Model of Summary Tables due to Cell Suppression. In Proceedings of the Second International Conference on Data Engineering. Washington, DC, USA: IEEE Computer Society, ISBN 0–8186– 0655–X, 75 – 83 Palley, M. A. and Simonoff, J. S. 1987 The use of regression methodology for the compromise of confidential information in statistical databases. ACM Transactions on Database Systems, 12, Nr. 4, 593–608 Papageorgiou, H. et al. 2001 A Statistical Metadata Model for Simultaneous Manipulation of both Data and Metadata. J. Intell. Inf. Syst. 17, Nr. 2-3, 169–192, ISSN 0925–9902 Parzen, E. 1962 On the estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065–1076 Pinkas, B. 2002 Cryptographic techniques for privacy preserving data mining. SIGKDD Explorations Newsletter, 4, Nr. 2, 12–19 Poole, W. K. 1974 Estimation of the Distribution Function of a Continuous Type Random Variable Through Randomized Response. Journal of the American Statistical Association, 69, 1002–1005 115 Poole, W. and Clayton, A. C. 1982 Generalizations of a Contamination. Model for Continuous Type Random Variables. Communications in Statistics Theory Methods, 11, 1733–1742 Raghunathan, T., Reiter, J. and Rubin, D. 2003 Multiple Imputation for Statistical Disclosure Limitation. 19, Nr. 1, 1–16 Ramesh, G., Maniatty, W. and Zaki, M. 2003 Feasible itemset distributions in data mining: theory and application. In Proceedings of the 22nd ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems., 284–295 Reiter, J. P. 2002 Satisfying Disclosure Restrictions With Synthetic Data Sets. 18, Nr. 4, 531–543 Reiter, J. P. 2003 Inference for partially synthetic, public use microdata sets. 29, Nr. 2, 181–188 Rizvi, S. and Haritsa, J. 2002 Maintaining data privacy in association rule mining. In Proceedings of the 28th International Conference on Very Large Data Bases. Rubin, D. B. 1987 Multiple Imputation for Nonresponse in Surveys. Volume 1, Wiley Rubin, D. B. 1993 Discussion Statistical Disclosure Limitation. 9, Nr. 2, 461–468 Samarati, P. 2001 Protecting Respondents’ Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering, 13, Nr. 6, 1010–1027, ISSN 1041–4347 Samarati, P. and Sweeney, L. 1998 Protecting Privacy when Disclosing Information: k-Anonymity and its Enforcement through Generalization and Suppression. Computer Science Laboratory, SRI International – Technical report hURL: http://www.csl.sri. com/papers/sritr-98-04/i Sande, G. 1983 Automated cell suppression to preserve confidentiality of business statistics. In Proceedings of the Second International Workshop on Statistical Database Management. Berkeley, CA, US: Lawrence Berkeley Laboratory, ISBN 1–87654–234–X, 346– 354 Sanil, A. P. et al. 2004 Privacy preserving regression modelling via distributed computation. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 677–682 Sarathy, R. and Muralidhar, K. 2002 The security of confidential numerical data in databases. Information Systems Research, 13, Nr. 4, 389–404 Schafer, J. 1997 Analysis of Incomplete Multivariate Data. Chapman Hall Shannon, C. E. 1948 A Mathematical Theory of Communication. Bell System Technical Journal, 27, 379–423 Shannon, C. E. 1949 Communication theory of secrecy systems. Bell System Technical Journal, 28, Nr. 4, 656–715 Stewart, G. W. 1980 The efficient generation of random orthogonal matrices with an application to condition estimation., 403–409 116 Stewart, G. and Sun, J. 1990 Matrix Perturbation Theory. Academic Press Sweeney, L. 2002 k-anonymity: a model for protecting privacy. 10, Nr. 5, 557–570, ISSN 0218–4885 Tan, V. Y. F. and Ng, S.-K. 2007 Generic Probability Density Function Reconstruction for Randomization in Privacy-Preserving Data Mining. In Machine Learning and Data Mining in Pattern Recognition, 5th International Conference(MLDM 2007). Volume 4571, Springer, 76–90 Vaidya, J. and Clifton, C. 2003 Privacy preserving k-means clustering over vertically partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining., 206–215 Vaidya, J. and Clifton, C. 2002 Privacy Preserving Association Rule Mining in Vertically Partitioned Data. In Proceedings of the 8th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 639–644 W3C 2002 Platform for Privacy Preferences (P3P). hURL: http://www.w3.org/TR/ P3P/i Wang, K., Yu, P. and Chakraborty, S. 2004 Bottom-up generalization: a data mining solution to privacy protection. In Proceedings of the 4th IEEE International Conference on Data Mining. Brighton, UK Wang, Y., Wu, X. and Zheng, Y. 2004 Privacy Preserving Data Generation for Database Application Performance Testing. In Proceedings of 1st International Conference on Trust and Privacy in Digital Business (TrustBus04)., 142–151 Warner, S. L. 1965 Randomized Response: A Survey Technique for Eliminating evasive answer bias. The American Statistical Association, 60, Nr. 309, 63–69 Weyl, H. 1911 Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen. Mathematische Annalen, 71, 441–479 Wright, R. and Yang, Z. 2004 Privacy-preserving Bayesian network structure computation on distributed heterogeneous data. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 713 – 718 Wu, C. W. 2003 Privacy Preserving Data Mining: A Signal Processing Perspective And A Simple Data Perturbation Protocol. In In IEEE ICDM Workshop on Privacy Preserving Data Mining., 10–17 Wu, X. et al. 2005a Privacy Aware Data Generation for Testing Database Applications. In Proceedings of the 9th International Database Engineering and Application Symposium., 317–326 Wu, X., Wang, Y. and Zheng, Y. 2003 Privacy preserving database application testing. In Proceedings of the ACM Workshop on Privacy in Electronic Society., 118–128 117 Wu, X., Wang, Y. and Zheng, Y. 2005 Statistical database modeling for privacy preserving database generation. In Proceedings of the 15th International Symposium on Methodologies for Intelligent Systems. Wu, X. et al. 2005b Privacy-Aware Market Basket Data Set Generation: A Feasible Approach for Inverse Frequent Set Mining. In Proceedings of the 5th SIAM International Conference on Data Mining., 103–114 Yao, A. 1982 Privacy Preserving Market Basket Data Analysis. In Proceedings of the 23rd Annual Symposium on Foundations of Computer Science., 160–164