Download ANALYSIS OF AND TECHNIQUES FOR PRIVACY PRESERVING

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
ANALYSIS OF AND TECHNIQUES FOR PRIVACY PRESERVING DATA MINING
by
Songtao Guo
A dissertation submitted to the faculty of
The University of North Carolina at Charlotte
in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in
Information Technology
Charlotte
2007
Approved by:
Dr. Yuliang Zheng
Dr. Xintao Wu
Dr. Zbigniew Ras
Dr. Zongwu Cai
Dr. Arun Ravindran
ii
c
°2007
Songtao Guo
ALL RIGHTS RESERVED
iii
ABSTRACT
Songtao Guo. Analysis of and techniques in privacy preserving data mining. (Under
the direction of DR. Yuliang Zheng and DR. Xintao Wu)
Privacy is often considered as a social, moral or legal concept. As Internet and ecommerce have prospered nowadays, privacy has become one of the most important issues
in IT and has received increasing attention from enterprises, consumers and legislators.
Although various techniques, such as randomization-based methods, cryptographicbased methods, and database inference control etc. have been developed, many key
problems still remain open in this area. Especially, new privacy and security issues have
been identified, and the scope of the privacy has been expanded. An essential problem
under the context is tradeoffs between the data utility and the disclosure risk. Since
previous research only conducted empirical evaluations or limited analysis for existing
randomization techniques, a more solid theoretical analysis is needed.
This dissertation investigates different perturbation models in randomization-based
privacy preserving data mining. Among them, the additive-noise-based model and the
projection-based-model are primary tools. For the additive-noise-based perturbation, the
explicit relation between noise and mining accuracy has not been carefully studied. We
first propose an improved strategy to reconstruct the data based on the representative
method. Then we develop explicit bounds of reconstruction error. Both the upper bound
and the lower bound provide a guideline to balance the privacy/accuracy tradeoff. We
also discuss other potential threats to the privacy based on our defined measure for
quantifying the privacy. For the projection-based perturbation, properties of different
models and possible disclosures within those models are analyzed in detail. Particularly,
we propose an A-priori Knowledge-based ICA attack (AK-ICA) which is effective against
all the existing projection models.
Due to the vulnerabilities in previous randomization models, a general-location-modelbased approach is proposed. It first builds a statistical model to fit the real data with
both categorical and numerical types of variables, then generates a synthetic data set
for mining by tuning parameters of the model instead of perturbing particular individual
iv
values. Since the search space of parameters of the model is much smaller than that
of data and all information which attackers can derive is contained in those parameters,
this approach is expected to be more effective and efficient. This dissertation investigates
privacy issues of the numerical data in this model, wherein the disclosure is analyzed and
controlled in different scenarios.
v
ACKNOWLEDGMENTS
In my way of pursuing this Ph.D. degree in the past years, so many people contributed
a lot, in many different ways, to make my success as a part of their own. I am so excited
of this moment, the moment I could publicly express my thanks to all of them.
First of all, I would like to thank my advisors Dr. Yuliang Zheng and Dr. Xintao Wu.
I am fortunate to be taken as a student of them. It was them who gave me great support,
understanding, and encouragement during my study. They teach me to be pro-active in
thinking, learning and living an integrated life. I wish to express my deepest gratitude to
Dr. Wu for his continuous guidance and support in my research. Without his intellectual
input, I would not complete my doctoral research.
Thanks are also due to Dr. Zbigniew Ras, Dr. Zongwu Cai and Dr. Arun Ravindran,
for serving as my committee members and giving me precious advice. They are very
supportive during my qualify exam and dissertation proposal.
In addition to my committee, I am thankful to my co-author, Dr. Yingjiu Li from Singapore Management University for the fruitful collaboration during my research. Truly
gratitude also goes to all those at the KDD Laboratory and the Laboratory of Information and Infrastructure Security, past and present, for friendship and camaraderie.
In particular, I thank Jing Jing, Ling Guo, Hangzai Luo, Yuli Gao, Dichao Peng, Peng
Tang, Yong Ye and Xiaowei Ying for the mind-sparking discussions and suggestions at
different phases of this dissertation.
I thank my dear friend Gao Zhang for introducing me to this school and sharing
memorable experience with me over the years. I have been blessed with some great
friends who have always been there to share in difficult and joyous occasions. To Xiaobin
You, Peiqin Zhang, Shan Xie, Guodong Jiao, Wujian Xue, Zixian Wang, Xiaoran Wu,
Qiang Shi, Yunfeng Sui, Alex Xiao, Su Dong, and Dingxiang Liu. I also extend my
thanks to Jane and Wayne who have welcomed me into their home and treated me like
their child.
I feel most indebted to my parents who gave me the important education in life.
vi
Thank my sister, Kelan, for her unwavering support and prayers. They always stood by
me whenever I needed them the most. Without their love and support from thousand
miles away, I would not be able to come this far. I would like to dedicate this dissertation
to my family.
My research was supported by U.S. NSF Grant IIS-0546027 and NSF Grant CCR0310974. I was also supported by the Department of Software and Information System.
vii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1 :
INTRODUCTION . . . . . . . . . . . . . . . . . . . . .
1
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Research Statement . . . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Dissertation Contributions . . . . . . . . . . . . . . . . . . . . .
3
1.4
Dissertation Organization . . . . . . . . . . . . . . . . . . . . .
4
CHAPTER 2 :
2.1
BACKGROUND AND RESEARCH ISSUES . . . . . . . . .
6
Privacy Preserving Data Mining and its Applications. . . . . . . . . .
6
2.1.1 Secure Multi-Party Computation . . . . . . . . . . . . . .
7
2.1.2 Data Randomization . . . . . . . . . . . . . . . . . . .
8
Additive-Noise-Based Perturbation . . . . . . . . . . . . . . . . .
9
Projection-Based Perturbation . . . . . . . . . . . . . . . . . . . .
10
Randomized Response . . . . . . . . . . . . . . . . . . . . . . . .
11
2.1.3 Data Imputation and Synthesis . . . . . . . . . . . . . . . 12
2.2
Data Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Data Suppression . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Data Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Research Issues in Preserving Privacy for Numerical Data. . . . . . . . 14
2.2.1 Issues in Additive-Noise-Based Perturbation . . . . . . . . . 14
2.2.2 Issues in Projective-Transformation-Based Perturbation. . . . . 15
2.2.3 Issues in Model-Based Privacy Preserving Data Mining . . . . . 16
2.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
CHAPTER 3 :
DISCLOSURE ANALYSIS OF THE ADDITIVE-NOISE-BASED
PERTURBATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1
Additive-Noise-Based Perturbation Model . . . . . . . . . . . . . . 18
viii
3.2
Data Reconstruction Attacks . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Spectral Filtering Method . . . . . . . . . . . . . . . . . 20
3.2.2 PCA-Based Reconstruction Method . . . . . . . . . . . . . 21
3.2.3 MLE-Based Reconstruction Method . . . . . . . . . . . . . 22
3.2.4 Privacy Issues . . . . . . . . . . . . . . . . . . . . . . 23
3.3
An Improved Strategy for Noise Filtering. . . . . . . . . . . . . . . 25
3.4
Upper Bound Analysis. . . . . . . . . . . . . . . . . . . . . . . 28
3.5
Lower Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 SVD-based Reconstruction Method . . . . . . . . . . . . . 37
3.5.2 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.3 Equivalence of Two Reconstruction Methods . . . . . . . . . 40
3.6
Potential Attack Based on Distribution . . . . . . . . . . . . . . . 41
3.6.1 Quantification of Privacy . . . . . . . . . . . . . . . . . . 42
3.6.2 Extension to Multiple Confidential Attributes . . . . . . . . . 44
3.7
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7.1 Scenario of Adding Noise. . . . . . . . . . . . . . . . . . 45
3.7.2 Effect of Varying the Number of Principal Components . . . . . 48
3.7.3 Effect of Varying Noise . . . . . . . . . . . . . . . . . . 50
3.7.4 Effect of Covariance Matrix of the Noise . . . . . . . . . . . 51
3.7.5 Utility. . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7.6 Lower Bound vs. Privacy Threshold . . . . . . . . . . . . . 53
3.7.7 Evaluation of IQR Attack . . . . . . . . . . . . . . . . . 54
3.8
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
CHAPTER 4 :
DISCLOSURE ANALYSIS OF THE PROJECTION-BASED
PERTURBATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1
Projection-Based Perturbation Models . . . . . . . . . . . . . . . . 62
4.1.1 Distance-Preserving-Based Projection . . . . . . . . . . . . 62
4.1.2 Non-Distance-Preserving-Based Projection . . . . . . . . . . 65
4.1.3 The General-Linear-Transformation-Based Perturbation . . . . 67
ix
4.2
Direct Attack . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 ICA Revisited . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Drawbacks of Direct ICA. . . . . . . . . . . . . . . . . . 70
4.3
Sample-Based Attack . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Attacks for Distance-Preserving-Based Projection . . . . . . . 71
Known-Sample-Based Regression Attack . . . . . . . . . . . . . .
71
Known-Sample-Based PCA Attack . . . . . . . . . . . . . . . . .
72
4.3.2 Attacks for Non-Distance-Preserving-Based Projection . . . . . 72
AK-ICA Attack . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
Existence of Transformation Matrix J
. . . . . . . . . . . . . . .
74
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
Determining J
4.3.3 Attacks for General Projection . . . . . . . . . . . . . . . 79
4.4
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.1 Effect of Noise and the Transformation Matrix . . . . . . . . 82
4.4.2 Effect of the Sample Size . . . . . . . . . . . . . . . . . . 84
4.4.3 Comparing AK-ICA and Known-Sample-Based PCA Attack. . . 87
4.4.4 Comparing AK-ICA and Spectral-Filtering-Based Attack . . . . 88
4.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
CHAPTER 5 :
DISCLOSURE ANALYSIS OF THE MODEL-BASED PRIVACY
PRESERVING APPROACH . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1
The General Location Model Revisited . . . . . . . . . . . . . . . . 94
5.2
Disclosure Controls For Numerical Data . . . . . . . . . . . . . . . 95
5.2.1 Basic Disclosure Scenario. . . . . . . . . . . . . . . . . . 96
5.2.2 Conditional Scenario . . . . . . . . . . . . . . . . . . . 101
5.2.3 Combination Scenario . . . . . . . . . . . . . . . . . . . 102
5.3
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
CHAPTER 6 :
CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . 105
6.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 106
x
6.3
Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 107
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
xi
LIST OF TABLES
TABLE 1.1:
Personal information of n customers
. . . . . . . . . . . . . . .
TABLE 3.1:
The relative error re(X, X̂) vs. varying E under three scenarios
2
(Type 1, 2, and 3) for the PATTERNS data set. The values with
∗ denote the results following Strategy 2, while the values with
† denote the results following the Strategy 1. The bold values
indicate those best estimations achieved by the Spectral Filtering
technique.
TABLE 3.2:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
The relative error re(X, X̂) vs. varying E under three scenarios
(Type 1, 2, and 3) for the ADULT data set.
. . . . . . . . . .
TABLE 3.3:
Utility of Reconstructed Adult Data with Type 1 Noise.
. . . .
TABLE 3.4:
Stock/bonds from Bank data set with Uniform noise [-125,125],
disclosure with 95% IQR, information loss for AS is 14.6%
. . .
47
53
55
TABLE 3.5:
Sinusoidal with Gaussian noise (0,8) using AS and SF methods
57
TABLE 4.1:
Reconstruction error vs. SNR for four cases when k = 1000 . . .
83
TABLE 4.2:
Reconstruction error vs. sample size(k) when Y = RX
85
TABLE 4.3:
Reconstruction error of AK-ICA vs. PCA attacks by varying R
TABLE 4.4:
Reconstruction error vs. SNR for spectral filtering method when
Y =X +E
. . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
91
xii
LIST OF FIGURES
FIGURE 3.1:
Distribution Reconstruction Algorithm . . . . . . . . . . . . . .
19
FIGURE 3.2:
Spectral Filtering Process
. . . . . . . . . . . . . . . . . . . . .
21
FIGURE 3.3:
PCA-Based Reconstruction Method . . . . . . . . . . . . . . . .
22
FIGURE 3.4:
MLE-Based Reconstruction Method . . . . . . . . . . . . . . . .
23
FIGURE 3.5:
SVD Based Reconstruction Algorithm
38
FIGURE 3.6:
Reconstruction accuracy (data distribution for attribute 2) vs.
varying k with σ 2 = 0.5
FIGURE 3.7:
. . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . .
Reconstruction accuracy (point-wise data distribution for attribute
2) with best k vs. varying noise magnitude . . . . . . . . . . . .
FIGURE 3.8:
49
51
Reconstruction accuracy (data distribution for attribute 2) with
kEk = 323 under three cases . . . . . . . . . . . . . . . . . . . .
52
Utility vs varying noise with type 1 . . . . . . . . . . . . . . . .
53
FIGURE 3.10: Utility vs. varying noises of three types . . . . . . . . . . . . . .
54
FIGURE 3.11: Achieved Reconstruction accuracy vs. varying privacy threshold τ
55
FIGURE 3.9:
FIGURE 3.12: Reconstructed stock/bands from bank data set using AS algorithm. The noise is Uniform distribution [-125,125]. . . . . . . .
56
FIGURE 3.13: Disclosure of Bank distribution with Uniform noise (AS algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
FIGURE 3.14: Disclosure analysis on Sinusoidal with Gaussian noise (0,8) using
AS and SF methods) . . . . . . . . . . . . . . . . . . . . . . . .
58
FIGURE 3.15: Stock/Bonds of Bank data set perturbed using Uniform distribution
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FIGURE 4.1:
Example of rotation-based perturbation
FIGURE 4.2:
Known-Sample-Based PCA Attack
59
. . . . . . . . . . . . .
64
. . . . . . . . . . . . . . . .
73
FIGURE 4.3:
AK-ICA Attack . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
FIGURE 4.4:
Distribution of component . . . . . . . . . . . . . . . . . . . . .
78
FIGURE 4.5:
The effect of noise E for RE . . . . . . . . . . . . . . . . . . . .
84
xiii
FIGURE 4.6:
Reconstruction error vs. varying known sample size k under Y =
RX
FIGURE 4.7:
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
Reconstruction error vs. random samples with the fixed size k =
50
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FIGURE 4.8:
Reconstruction error of AK-ICA vs. PCA attacks by varying R
FIGURE 4.9:
Reconstruction error vs. SNR for SF and AK-ICA (with fixed size
86
88
k = 1000) when Y = X + E . . . . . . . . . . . . . . . . . . . .
90
FIGURE 5.1:
A constant density contour for a bi-variate normal distribution .
97
FIGURE 5.2:
Confidence Intervals
98
FIGURE 5.3:
Density contour with varied covariance matrix . . . . . . . . . . 102
. . . . . . . . . . . . . . . . . . . . . . . .
CHAPTER 1: INTRODUCTION
1.1
Motivation
With the advance of the information age, data collection and data analysis have exploded both in size and complexity. The attempt to extract important patterns and
trends from the vast data sets has led to a challenging field called Data Mining. When
a complete data set is available, various statistical, machine learning, and data mining
techniques can be applied to analyze the data.
Sensitive data usually includes information regarding individuals’ physical or mental
health, financial privacy etc. In the third party context, a single party (data holder) holds
a collection of original individual data with privacy concerns. The data holder can utilize
or release data to the third party for analysis, however, he is required not to disclose any
private information. For example, one company collects its employees personal data (e.g.,
income, age, etc.) and needs to release this data set to the third party for analysis. Since
each employee has his/her concern on the privacy of their personal data, the company
should figure out ways only to release data while guaranteeing no private individual
information can be derived by attackers or snoopers.
Another context involves end data providers as clients and a data collector as the
server. As end data providers, they would like to share their data for analysis, however,
to preserve their privacy has the same importance. As a server, it mainly aims to extract
patterns from the data or from the distribution of the data. The aggregate information
learnt from the collection might be rich enough for its mining tasks. During the process of
data collection, the individual data shall be randomized before it reaches the server. By
combining parameters of the randomization and the perturbed data, aggregate statistical
properties of the original data can be derived to contribute to mining tasks.
In the third context, data are distributed across different sites. Traditionally, the data
2
warehousing approaches can be used to mine distributed databases. It requires that
data from all the participating sites are collected at a centralized warehouse. However,
many data owners may be reluctant to share their data with others due to privacy and
confidentiality concerns. This is a serious impediment to perform mutually beneficial
data mining tasks.
Privacy-Preserving Data Mining (PPDM) has emerged to address issues under above
contexts. The research of PPDM is aimed at bridging the gap between collaborative
data mining and data confidentiality. It involves many areas such as statistics, computer
sciences, and social sciences. It is of fundamental importance to homeland security,
modern science, and to our society in general.
Table 1.1: Personal information of n customers
ID
SSN
Name
Zip
Race
···
Age
Gender
1
2
3
4
···
n
***
***
***
***
···
***
***
***
***
***
···
***
28223
28223
28262
28261
···
28223
Asian
Asian
Black
White
···
Asian
···
···
···
···
···
···
20
30
20
26
···
20
M
F
M
M
···
M
Balance
($1,000)
10
15
50
45
···
80
Income
($1,000)
85
70
120
23
···
110
···
···
···
···
···
···
···
Interest Paid
($1,000)
2
18
35
134
···
15
Table 1.1 provides an example of n customers’ original personal information which includes various attributes. Disclosures that can occur as a result of inferences by snoopers
include two classes: identity disclosure and value disclosure. Identity disclosure relates
to the disclosure of identities of individuals in the database while value disclosure relates to the disclosure of the value of a certain confidential attribute of those individuals.
There is no doubt that identity attributes, such as SSN and Name, should be masked
to protect privacy before the data is released. However, some categorical attributes,
such as Zip, Race, Gender, can also be used to identify individuals by linking them to
some public available data set. Those attributes hence are called as Quasi-Identifiers
[Samarati 2001]. There have been a lot of research on how to prevent identity disclosure, such as the well known statistical disclosure control (SDC) methods [Adam and
Wortman 1989; Malvestuto, Moscarini and Rafanelli 1991; Domingo-Ferrer and MateoSanz 2002; Domingo-Ferrer and Torra 2003], k-anonymity [Samarati and Sweeney 1998;
3
Samarati 2001; Sweeney 2002; LeFevre, DeWitt and Ramakrishnan 2006], `-diversity
[Machanavajjhala et al. 2006] and t-closeness [Li, Li and Venkatasubramanian 2007]. To
prevent value disclosures, various randomization based approaches (e.g., [Agrawal and
Srikant 2000; Palley and Simonoff 1987; Sarathy and Muralidhar 2002; Rizvi and Haritsa 2002; Du and Zhan 2003; Oliveira and Zaiane 2004; Chen and Liu 2005; Liu, Kargupta
and Ryan 2006]) have been investigated.
1.2
Research Statement
After the data miners collect large amount of private data from data providers, the
data might be perturbed in different ways in order to avoid the privacy disclosure, as well
as to keep some useful patterns for further data mining. The focus of this dissertation
is to utilize formal methods to analyze various perturbation models and explore the
balance between data utility and the disclosure risk in privacy preserving data mining.
Specifically, my dissertation aims to
1. analyze the accuracy of estimations for various perturbation models;
2. explore potential attacks for existing models and evaluate their performance;
3. design models to control privacy and data utility for privacy preserving applications.
1.3
Dissertation Contributions
As parts of a novel framework for privacy preserving data mining, the main contributions achieved in my research are summarized as follows:
1. Bound analysis of the accuracy of value reconstruction techniques. In particular, we
firstly derive one upper bound for the Frobenius norm of reconstruction errors using
the matrix perturbation theory. This upper bound may be exploited by attackers to
determine how close their estimates are to the original data using spectral-filteringbased method, which imposes a serious threat of privacy breaches. We then derive
a lower bound for the reconstruction error, which can help data owners determine
how much noise should be added to satisfy one given threshold of tolerated privacy
breach. Besides, an improved data reconstruction strategy for noise filtering is also
4
given. In the context of the additive-noise-based perturbation, we develop a new
strategy by comparing the benefit due to inclusion of one component with the loss
due to the additional projected noise. We show that such strategy is expected to
give one approximately optimal reconstruction from the perturbed data.
2. An effective attacking method to break projection-based perturbation. By combining a known small subset of the original data, which is reasonable in practice, our
algorithm, AK-ICA, can effectively estimate the whole original data set with high
accuracy. Other nice properties of this attacking method also include its robustness
to arbitrary projection-based perturbation. All the previous perturbation methods
under this context are vulnerable to our attack. Therefore, current projection-based
privacy preserving data mining techniques may need a careful scrutiny in order to
prevent privacy breaches when a subset of sample data is available.
3. A measure to quantify the individual privacy disclosure. We propose a way to
measure how close the inter-quantile range obtained by attackers or snoopers is to
individual’s privacy interval for some particular sensitive variable. We also extend
such measure to multivariate case based on the confidential region.
4. Disclosure control methods for various scenarios in a model based privacy preserving
data mining. General databases typically contain numerous attributes with different privacy concerns. To satisfy different privacy requirements from data providers,
we analyze potential privacy disclosures in several scenarios and find ways to adjust
parameters of the model learned from the underlying data.
1.4
Dissertation Organization
The rest of this dissertation is organized as follows:
In Chapter 2, the current research on privacy preserving data mining is briefly reviewed.
Various models in the randomization-based Privacy Preserving Data Mining, including
additive-noise-based perturbation, projection-based perturbation, randomized response,
and model-based perturbation, are introduced and research issues within those models
are outlined.
5
In Chapter 3, the additive-noise-based randomization techniques are analyzed. To
preserve individual privacy, the model perturbs the data by introducing the additive
noise. This chapter firstly introduces how to learn distributions from the randomized
data followed by various data mining algorithms which are used to reconstruct the individual data, therefore, acting as potential threats to the privacy. Then it answers three
important questions by carefully analyzing a representative data reconstruction method,
spectral-filtering-based method. Those questions are: What is the best strategy to reconstruct the original data based on this method? What is the upper bound of the
reconstruction error? What is the lower bound of the reconstruction error? As a potential threat to the privacy in this model, another possible attacking method, IQR-based
attack, is also proposed in this chapter.
In Chapter 4, the projection-based randomization technique is analyzed. Different from
additive-noise-based randomization, projection-based approach randomizes the original
data through a linear transformation which is, in form, a projection matrix applied on the
original data matrix. Two typical types of projections and their properties are introduced
at the beginning of this chapter: distance-preserving-based and non-distance-preservingbased projections. Vulnerabilities of different projection models are then discussed and
evaluated, with sections on direct attack and sample-based attacks. In particular, this
chapter offers an attacking method, called A-priori-Knowledge-ICA (AK-ICA), which is
effective for all the projection-based randomization models.
In Chapter 5, General-Location-Model is proposed to privacy preserving data mining.
The General Location Model acts as an efficient tool to model real-life databases for
privacy preserving applications. For the numerical data in this model, how to analyze
and control the privacy is discussed in detail in three different scenarios.
Chapter 6 concludes this dissertation with a brief summary of the research presented
and offers the future directions.
CHAPTER 2: BACKGROUND AND RESEARCH ISSUES
Perfect privacy can be achieved without sharing any data, but it offers no utility;
perfect utility can be provided by publishing the exact data collected from our lives,
but it sacrifices the privacy. The ”inevitable conflict between the individual’s right to
privacy and the society’s need to know and process information” has been addressed
since 1970’s in database and statistics communities [Chang and Moskowitz 2000; Duncan
and Mukherjee 2000; Evans, Zayatz and Slanta 1998; Fienberg, Makov and Steele 1998;
Gouweleeuw et al. 1998; Mukherjee and Duncan 1997]. In recent decades, significant efforts have been spent on regulation [Congress 1999; Commission 1998b; Congress 1996;
Commission 1998a], privacy policy description [Karjoth and Schunter 2002; FischerHübner 2001; Backes, Pfitzmann and Schunter 2003; W3C 2002] and implementation
[Ashley et al. 2002; Karjoth, Schunter and Waidner 2002]. An increasing number of
enterprises make privacy promises to meet customer demand or to implement privacy
regulations.
The privacy issues in data mining began to be investigated in the late 1990’s. Over
the past several years, growing number of successful techniques were proposed in the
literature to obtain valid data mining results while preserving privacy at different levels.
This chapter reviews the existing privacy preserving data mining techniques and outlines
the important research issues which are addressed in this dissertation.
2.1
Privacy Preserving Data Mining and its Applications
We classify representative privacy preserving data mining techniques into several categories. Generally, there are three approaches: Secure Multi-party Computation (SMC),
Data Randomization, and Data Imputation and Synthesis.
7
2.1.1
Secure Multi-Party Computation
Secure Multiparty Computation(SMC) is a technique addressing the problem of computing a joint function based on multiple inputs. Each party in a distributed environment
keeps one part of private inputs. SMC ensures that no more information should be disclosed to one party than the output of the joint function and its own share of input. The
problem of SMC was firstly formulated by Yao [Yao 1982] and extended by Goldreich et
al. [Goldreich, Micali and Wigderson 1987], and by many others.
In the ideal model, all parties send their inputs to a trusted third party (TTP), who
then performs the computations and delivers only the results to other parties. In semihonest model, adversary correctly follows the protocol with the exception that it attempts
to learn additional information by analyzing all the intermediate computations. While
in the malicious model, adversary may arbitrarily deviate from the protocol specification
(e.g. aborting or suspending computation).
According to [Benenson, Freiling and Kesdogan 2005], a protocol solves secure multiparty computation (SMC) if it owns the following properties:
1. (SMC-Validity) If a process receives an F -result, then F was computed with at
least the inputs of all correct processes.
2. (SMC-Agreement) If some process pi receives F -result ri and some process pj receives F -result rj then rj = rj .
3. (SMC-Termination) Every correct process eventually receives an F -result.
4. (SMC-Privacy) Faulty processes learn nothing about the input values of correct
processes (apart from what is given away by the result r and the input values of
all faulty processes).
Assume that F is a well-known deterministic function, and denote F -result as a result
computed by F , r = F (x1 , x2 , · · · , xn ).
Several SMC-based privacy-preserving data mining schemes have been proposed [Lindell and Pinkas 2002; Pinkas 2002; Vaidya and Clifton 2002; Vaidya and Clifton 2003;
8
Clifton et al. 2003]. Lindell and Pinkas [Lindell and Pinkas 2002] introduced SMC for
classification over horizontally partitioned data using the ID3 algorithm. Vaidya and
Clifton proposed the solutions to the clustering problem [Vaidya and Clifton 2003] and
the association rule mining problem [Vaidya and Clifton 2002] for vertically partitioned
data. Several SMC tools and fundamental techniques are also proposed in the literature [Pinkas 2002; Clifton et al. 2003]. Some more schemes were presented in recent
conferences as follows. Wright et al. [Wright and Yang 2004] and Menget al. [Meng,
Sivakumar and Kargupta 2004] used SMC to solve privacy-preserving Bayesian network
problems. Gilburd et al. proposed a new privacy model, k-privacy, for real-world largescale distributed systems [Gilburd, Schuster and Wolff 2004]. Sanil et al. described a
privacy-preserving algorithm of computing regression coefficients [Sanil et al. 2004]. Du et
al. have developed building blocks to solve secure two-party Multivariate Linear Regression and Classification problems [Du, Han and Chen 2004]. Wang et al. used an iterative
bottom-up generalization to generate data, which remains useful to classification but
difficult to disclose private sources [Wang, Yu and Chakraborty 2004]. SMC provides
us a good research framework of conducting computations among multiple parties while
maintaining the privacy of each party’s input. However, all of the known methods for
secure multiparty computation rely on the use of a circuit to simulate the particular function, which becomes the efficiency bottleneck. Even with some improvements [Gennaro,
Rabin and Rabin 1998], the computational costs for problems of interest remain high,
and the impact on real-world applications has been negligible
2.1.2
Data Randomization
In the randomization approach, random noises are added or transformations are applied to the original data, and only the disguised data are shared [Agrawal and Srikant 2000;
Agrawal and Agrawal 2001; Rizvi and Haritsa 2002; Evfimievski et al. 2002; Du and
Zhan 2003]. Representative randomization methods include additive-noise-based perturbation, projection-based perturbation, and Randomized Response scheme.
9
Additive-Noise-Based Perturbation
Agrawal and Srikant proposed a scheme for privacy-preserving data mining using random perturbation [Agrawal and Srikant 2000]. In their randomization scheme, a random
number is added to the value of a sensitive attribute. For example, if xi is the value of a
sensitive attribute, xi +ri , rather than xi , will appear in the database, where r is a random
value drawn from some distribution. It is shown that given the distribution of random
noises, recovering the distribution of the original data is possible. Authors in [Agrawal
and Agrawal 2001] solved above problem with a Expectation Maximization(EM) estimation algorithm [Dempster, Laird and Rubin 1977; G. J. McLachlan 1998]which has
better convergence properties. They showed that their estimation is able to converge
to the maximum likelihood estimate (MLE). Wu further optimized the computation of
the reconstruction algorithm by using a signal processing approach [Wu 2003]. Tan et.
al. [Tan and Ng 2007] proposed a two-step non-iterative distribution reconstruction algorithm based on Parzen-window reconstruction [Parzen 1962] and Quadratic Programming
over a convex set [Kozlov, Tarasov and Khachian 1979]. The algorithm avoids the cost
due to the iterations in EM and it is also proven to be generic for many randomization
models which satisfy a given form. The randomization techniques have been used for a
variety of privacy preserving data mining tasks [Agrawal and Agrawal 2001; Rizvi and
Haritsa 2002; Evfimievski et al. 2002; Du and Zhan 2003]. Under the scheme, Evfimievski
et al. proposed an approach to conduct Privacy-Preserving Association Rule Mining
[Evfimievski et al. 2002].
Kargupta et al. challenged the randomization schemes, and pointed out that randomization might not be secure [Kargupta et al. 2003]. They also proposed a random
matrix-based Spectral Filtering (SF) technique to recover the original data from the perturbed data. Huang et al. further proposed two other data reconstruction methods:
PCA-DR and MLE-DR in [Huang, Du and Chen 2005]. The former one is based on the
Principal Component Analysis (PCA), while the latter one uses Maximum Likelihood
Estimation (MLE). Their results showed that the recovered data can be reasonably close
to the original data. However, what’s the best strategy the attackers might choose and
10
how close might be the recovered data to the original one, are not addressed in their
works. Motivated by the spectral-filtering-based method, Guo et al. [Guo, Wu and
Li 2006b] improved the spectrum selecting strategy to achieve the optimal performance.
Besides, they theoretically analyzed the spectral-filtering-based method and bounded
the reconstruction error which is meaningful to both data miners and attackers, [Guo
and Wu 2006; Guo, Wu and Li 2006b]. Guo et al. also challenged additive-noise-based
model by proposed IQR attack in [Guo, Wu and Li 2006a]. According to their defined
privacy quantification, individual privacy may be threatened by the estimated distribution. All the above results indicated that for certain types of data, additive-noise-based
randomization might not preserve privacy as much as we expected.
Projection-Based Perturbation
The projection based perturbation model can be described by
Y = RX
(2.1)
Where X ∈ Rp×n is the original data set consisting of n data records and p attributes.
Y ∈ Rq×n is the transformed data set consisting of n data records and q attributes. R is
a q × p transformation matrix. In this study, we shall assume q = p = d for convenience.
In [Chen and Liu 2005], the authors defined a rotation based perturbation method,
where the transformation matrix R is a d × d orthogonormal matrix satisfying RT R =
RRT = I. The key features of rotation transformation are preserving vector length,
Euclidean distance and inner product between any pair of points. Intuitively, rotation
preserves the geometric shapes such as hyperplane and hyper curved surface in the multidimensional space. It was proved in [Chen and Liu 2005] that three popular classifiers
(kernel method, SVM, and hyperplane-based classifiers) are invariant to the rotation
based perturbation.
Previously, the authors in [Oliveira and Zaiane 2004] defined another rotation-based
data perturbation function that distorts the attribute values of a given data matrix to
preserve privacy of individuals. The perturbation matrix R they used is an orthogonormal matrix when there are even number of attributes. If we have odd number of
11
attributes, according to their scheme, the remaining one is distorted along with any
previous distorted attribute, as long as some condition is satisfied.
By observing vulnerabilities of the above distance-preserving-based projection, Liu et
al. [Liu, Giannella and Kargupta 2006] discussed possible attacks, including a knowninput attack which is based on linear regression and a known sample attack which is
based on principal component analysis. Liu et al [Liu, Kargupta and Ryan 2006] further
proposed a random projection-based multiplicative perturbation scheme and applied it
for privacy preserving distributed data mining. The random matrix Rk×m is generated
such that each entry ri,j of R is independent and identically chosen from some normal
distribution with mean zero and variance σr2 . Thus, the following properties of the
rotation matrix are achieved.
E[RT R] = kσr2 I
E[RRT ] = mσr2 I
If two data sets X1 and X2 are perturbed as Y1 =
√ 1 RX1
kσr
and Y2 =
√ 1 RX2
kσr
perspec-
tively, then the inner product of the original data sets will be preserved from the statistic
point of view:
E[Y1T Y2 ] = X1T X2
Randomized Response
Randomized Response (RR) is a technique originally developed in the statistics community to collect sensitive information from individuals in such a way that survey interviewers and those who process the data do not know which of two alternative questions
the respondent has answered. Instead of asking interviewee whether he/she belongs to a
sensitive category A, the interviewer asks each interviewee two mutually exclusive questions:
1. Do you belong to the category A?
2. Do you belong to the category Ā?
The one to be answered is determined by a randomize device. The probability of
choosing the first question is θ, and the probability of choosing the opposite one is 1 − θ.
12
Without knowing which question is answered, the interviewer shall have no idea about
the collected response, e.g. ”yes” or ”no”. Since no one but the respondent knows to
which question the answer pertains, the technique provides response confidentiality and
increases respondents’ willingness to answer sensitive questions. Assuming all interviewees told the truth, we have
P (”yes”) = P (A) · θ + P (Ā) · (1 − θ)
= P (A) · θ + (1 − P (A)) · (1 − θ)
(2.2)
If θ 6= 1/2, without accessing the exact private information, the proportion of interviewees
who actually belong to the category A can be estimated.
P (A) =
θ−1
1
+
P (”yes”)
2θ − 1 2θ − 1
The Randomized Response(RR) was firstly proposed by Warner in 1965 [Warner 1965]. It
is mainly used to deal with categorical data and can be extended to estimate the distribution of numerical data. Different other models and corresponding discussions for categorical and numerical data can be found in [Chaudhuri and Mukerjee 1988; Poole 1974; Duffy
and Waterton 1984; Poole and Clayton 1982]. In data mining community, Rizvi and Haritsa presented a scheme called MASK to mine associations with secrecy constraints [Rizvi
and Haritsa 2002]. Du and Zhan proposed an approach to conduct Privacy-Preserving
Decision Tree Building [Du and Zhan 2003]. Guo et al. addressed the issue of providing accuracy in terms of various reconstructed measures (e.g., support, confidence,
correlation, lift, etc.) in privacy preserving market basket data analysis [Guo, Guo and
Wu 2007]. More specifically, they presented a general method based on the Taylor series
to approximate the mean and variance of estimated variables from the randomized data.
They also showed that the derived confidence ranges and monotonic property of seom
measures are critical for the rule selection.
2.1.3
Data Imputation and Synthesis
In general databases, particular individual values or even the whole data set are sensitive and might lead to identification disclosure. Anonymization can be achieved by
13
suppressing individual values, swapping sensitive values, replacing certain attribute values with a general one, or even replacing the whole data set with a synthetic data set.
Data Swapping
This technique transforms the database by switching a subset of attributes between
selected pairs of records so that statistical properties such as marginal distributions of
individual attributes are preserved while data confidentiality is achieved. This technique
was first proposed by authors in [Dalenius and Reiss 1982]. A variety of refinements and
applications [Fienberg and McIntyre 2003] of data swapping have been addressed since
its initial appearance.
Data Suppression
As a technique applied in Statistical Database(SDB), it suppresses those cells in released
tables that might directly or indirectly disclose confidential information. Its early application by census bureaus for data publishing was studied in [Cox 1980; Sande 1983].
Thorough studies on this technique can be found in [Denning, Schlörer and Wehrle 1982;
Özsoyoglu and Chung 1986]. Due to the high information loss caused by this technique
and complex queries in practice, its application to real world databases has inevitable
limitations.
Data Synthesis
In [Rubin 1993], the author firstly suggested research effort to develop a technique for
publishing synthetic data without releasing any actual individual value. Based on the
success of multiple imputation [Rubin 1987], the published synthetic data set could be
derived from distributions learnt from the actual data. Many difficult and complex
modeling issues have been addressed [Kam and Ullman 1977; Papageorgiou et al. 2001;
Reiter 2002; Reiter 2003; Raghunathan, Reiter and Rubin 2003].
In [Ramesh, Maniatty and Zaki 2003], authors proposed a method to generate a market
basket data set for bench marking when the length distributions of frequent and maximal
frequent item set collections are available. Wu et al. in [Wu, Wang and Zheng 2003;
Wang, Wu and Zheng 2004; Wu et al. 2005a; Wu, Wang and Zheng 2005; Wu et al. 2005b]
14
proposed a general framework for privacy preserving database application testing by
generating synthetic data sets based on some a-priori knowledge about the production
databases. The general a-priori knowledge such as statistics and rules can also be taken
as constraints of the underlying data records.
In [Aggarwal and Yu 2004], authors proposed a condensation approach which aims at
preserving the covariance matrix for multiple columns. Different from the randomization
approach, it perturbs multiple columns as a whole to generate the perturbed dataset.
The authors argued that the perturbed dataset preserves the covariance matrix, and
thus, most existing data mining algorithms can be applied directly to the perturbed
dataset without redeveloping new ones.
2.2
Research Issues in Preserving Privacy for Numerical Data
This section highlights the main research issues which will be addressed in the remaining part of this dissertation. Those issues are raised from different privacy preserving
models introduced in the previous section with focus on the numerical data.
2.2.1
Issues in Additive-Noise-Based Perturbation
Consider a data set X with m records of n attributes and a noise data set E with same
dimensions as X. The random value perturbation techniques generate a perturbed data
matrix Y = X +E . Let X̂ denote the estimate which the users (or attackers) can achieve.
To preserve utility, certain aggregate characteristics (i.e., mean and covariance matrices
for numerical data, or marginal totals in contingency table for categorical data) of X
should remain basically unchanged in the perturbed data Y or can be restored from the
estimated data X̂ . In other words, distributions of X can be approximately reconstructed
from the perturbed data Y when some a-priori knowledge (e.g., distribution, statistics
etc.) about the noise E is available using distribution reconstruction approaches (e.g.,
[Agrawal and Agrawal 2001], [Agrawal and Srikant 2000]).
To preserve privacy, not only the difference between Y and X but also that between
X̂ and X should be greater than some tolerated threshold. Here we follow the tradition of using the difference as the measure to quantify how much privacy is preserved.
15
A key element in preserving privacy and confidentiality of sensitive data is the ability
to evaluate the extent of all potential disclosure for released data. In other words, we
need to answer to what extent confidential information in the perturbed data can be
compromised by attackers. Hence, we should consider not only the perturbed data, Y ,
which is released directly, but also the estimated data, X̂ , which attackers may exploit
various reconstruction methods to obtain. The methods investigated in [Agrawal and
Agrawal 2001; Agrawal and Srikant 2000] only focused on how to reconstruct the distribution of the original data from the perturbed data. But it did not consider the issue that
attackers may reconstruct the individual values through various means. The previous
work in [Huang, Du and Chen 2005; Kargupta et al. 2003] exploited spectral properties
of the data and showed that the noise may be separated from the perturbed data under
some conditions and as a result privacy could be seriously compromised. Although they
empirically assessed the effects of perturbation on the accuracy of the estimated individual value, one major question is what explicit form between reconstruction accuracy and
noise added may exist. In other words, what bounds of reconstruction accuracy can be
achieved by this spectral filtering technique? Other research issues may include, but not
restricted to, developing the best strategy to reconstruct the data, quantifying privacy
disclosure and exploring other possible threats to data owner’s privacy.
2.2.2
Issues in Projective-Transformation-Based Perturbation
Distance-preserving-based projection has gained much attention in privacy-preserving
data mining in recent years since it mitigates the privacy/accuracy tradeoff by achieving
perfect data mining accuracy. In the meanwhile, its vulnerabilities are, still, of great
interest to data owners and attackers.
A general projection-based perturbation can be expressed as
Y = RX
where R is a transformation matrix and X,Y are input and output respectively.
For the case where RT R = RRT = I, it seems that privacy is well preserved after
rotation, however, a small known sample may be exploited by attackers to breach privacy
16
completely. We assume that a small data sample from the same population of X is
available to attackers, denoting as X̃. When X ∩ X̃ = X ‡ 6= ∅, since many geometric
properties (e.g. vector length, distance and inner product) are preserved, attackers can
easily locate X ‡ ’s corresponding part, Y ‡ , in the perturbed data set by comparing those
values. From Y = RX, we know the same linear transformation is kept between X ‡ and
Y ‡ : Y ‡ = RX ‡ . Once the size of X ‡ is at least rank(X) + 1, the transformation matrix
R can easily be derived through linear regression.
For the case where X ‡ = ∅ or too small, the authors in [Liu, Giannella and Kargupta 2006] proposed a Principal Component Analysis(PCA) based attack. The idea is
briefly given as follows. Since the known sample and private data share the same distribution, eigenspaces (eigenvalues) of their covariance matrices are expected to be close to
each other. As what we knew, the transformation here is a geometric rotation which does
not change the shape of distributions (i.e., the eigenvalues derived from the sample data
are close to those derived from the transformed data). Hence, the rotation angels between
the eigenspace derived from known samples and those derived from the transformed data
can be easily identified. In other words, the rotation matrix R is recovered.
We notice that all the above attacks are just for the case in which the transformation
matrix is orthonormal. In our general setting, the transformation matrix R can be any
matrix (e.g. shrink, stretch, dimension reduction) rather than the simple orthonormal
rotation matrix. When we try to apply the PCA attack on non-isometric projection
scenario, the eigenvalues derived from the sample data are not the same as those derived
from the transformed data. Hence, we cannot derive the transformation matrix R from
spectral analysis. As a result, the previous PCA based attack will not work any more.
Is individual privacy well protected by such transformation? Is this kind of perturbation
vulnerable to any other attacks? More thorough disclosure analysis will be discussed for
such general scenario in this dissertation.
2.2.3
Issues in Model-Based Privacy Preserving Data Mining
The issue of confidentiality and privacy in general databases has become increasingly
prominent in recent years. A key element in preserving privacy and confidentiality of
17
sensitive data is the ability to evaluate the extent of all potential disclosure for such data.
In other words, we need to be able to answer to what extent confidential information in a
perturbed or transformed database can be compromised by attackers or snoopers. This
is a major challenge for current randomization based approaches.
To evaluate the privacy and confidentiality residing in general databases which contain
both categorical attributes and numerical attributes, the authors in [Wu, Wang and
Zheng 2005] proposed a general framework for modeling general databases using the
General Location model. One advantage of the general location model is it can be used
to conduct both identity disclosure analysis and value disclosure analysis respectively
since it integrates both categorical attributes and numerical attributes in one model. Our
research will focus on the value disclosure for numerical attributes and give solutions to
control the privacy by tuning corresponding parameters of the model.
2.3
Summary
Due to the increasing ability to trace, collect and analyze large amount of personal or
sensitive data, privacy has become an important issue in various domains. In this chapter,
we provided a overview of existing PPDM techniques present in the literature, which
can be classified into Secure Multi-party Computation, Data Randomization, and Data
Imputation and Synthesis. Research issues related to Data Randomization were outlined
and split into three particular areas: data reconstruction in the additive-noise-based
perturbation model, data reconstruction in the projection-based perturbation model and
the disclosure control in a general-location-model-based privacy preserving application.
CHAPTER 3: DISCLOSURE ANALYSIS OF THE ADDITIVE-NOISE-BASED
PERTURBATION
3.1
Additive-Noise-Based Perturbation Model
In [Agrawal and Srikant 2000], Agrawal and Srikant firstly proposed the additive-noisebased perturbation method for building decision-tree classifiers. To hide the original n
values x1 , · · · , xn , n independent random noises e1 , · · · , en have been added and the
perturbed data y1 , · · · , yn are released for data mining, where yi = xi + ei . Such process
can be illustrated by the following example.
Example 1 In Table 1.1, those numerical information(i.e. Balance, Income, Interest
Paid, etc.) can be expressed by a matrix X where each row represents a record of one
customer. By adding a random noise matrix E with the same dimension, we can get the
perturbed data Y as follows.
Y
= X
+E
10 85 ... 2
 15 70 ... 18

 50 120 ... 35
= 
 45 23 ... 134

 ... ... ... ...
80 110 ... 15

17.334 88.759
 19.199 77.537

 59.199 128.447
= 
 51.208 30.313

 ...
...
89.048 115.692


7.334
  4.199
 
  9.199
+
  6.208
 
  ...
9.048

... 2.099
... 25.939 

... 38.678 

... 135.939 


...
...
... 21.318
3.759
7.537
8.447
7.313
...
5.692
...
...
...
...
...
...
0.099
7.939
3.678
1.939
...
6.318








The perturbed data shall be quite different from the original ones and distributions of
data also change a lot. Therefore, the privacy of data providers is supposed to be well
protected when the perturbed data is released instead. To be consistent in this chapter,
19
we use X to denote the original data, E to denote the additive noise and Y to represent
the perturbed data. Formally, we have:
Y =X +E
(3.1)
Agrawal and Srikant also showed that the original density distribution of X can be
reconstructed effectively given the perturbed data and the noise’s distribution, fE . Based
on the reconstructed distribution, decision-tree classifiers can be built with the accuracy
comparable to the accuracy of classifiers built with the original data. Their reconstruction
algorithm is sketched as Figure 3.1.
input
Y , a given perturbed data set
fE , distribution function of the noise
output fˆX , an estimation of the distribution of the original variable
BEGIN
1
Assume fX0 as an uniform distribution.
2
j=0
REPEAT
P
f (y −a)f j (a)
3
fXj+1 (a) = n1 ni=1 R ∞ Ef (yi −z)fXj (z)dz
−∞ E
4
5
END
i
X
j =j+1
UNTIL (stopping criterion met)
fˆX = fXj+1 (a)
Figure 3.1: Distribution Reconstruction Algorithm
The posterior distribution function for X is estimated by the average of n posterior
distribution functions, FXi |Yi =yi , for i.i.d. variables Xi , where i = 1, · · · , n. And each
FXi |Yi =yi is estimated using Bayes’ rule [Fisz 1963]:
Z
FXi |Yi =yi (a) =
a
−∞
Z a
fXi |Yi =yi (x)dx
fXi (x)fYi |Xi =xi (y)
dx
0 )f
0
0 (y)dx
f
(x
X
Y
|X
=x
i
i
i
−∞
R∞
=
−∞
Ra
= R−∞
∞
−∞
fE (yi − x)fX (x)dx
fE (yi − x)fX (x)dx
20
Above bootstrap process stops when the difference between two successive estimates
becomes small and this process can also be extended to multi-variate data.
It seems that individual privacy is well protected in this model. However, through
intensive studies of this model [Kargupta et al. 2003; Huang, Du and Chen 2005; Guo and
Wu 2006; Guo, Wu and Li 2006b; Guo, Wu and Li 2006a], it was indicated that privacy
can still be threatened. The following section will give an introduction of those potential
attacks and representative reconstruction algorithms. As part of our contributions of this
research, in depth discussion on those reconstruction algorithms, especially the estimation
strategy and estimation accuracy, will be given in the remaining sections of this chapter.
3.2
Data Reconstruction Attacks
The security of additive-noise-based approach was firstly questioned by Kargupta et
al. in [Kargupta et al. 2003]. They showed that attackers may exploit a spectral-filteringbased attack to derive the individual estimation of the original values from the perturbed
data. Huang et al. further proposed two other reconstruction algorithms which are
efficient when the noise is independent to the original data. Similar to the Spectral
Filtering technique, one is based on Principal Component Analysis (PCA). The other
one chooses Maximum Likelihood Estimation (MLE) as estimator. This section offers an
overview of these reconstruction methods and presents privacy issues we addressed based
on them.
3.2.1
Spectral Filtering Method
Consider a noise matrix E with same dimensions as the original data X. Entries of
the noise are i.i.d. random variables with zero mean and variance σ 2 . The random value
perturbation techniques generates a perturbed data matrix Y = X + E. The objective of
spectral filtering based approach is to derive the estimation X̂ of X from the perturbed
data Y based on random matrix theory. The authors, in [Kargupta et al. 2003], provided
an explicit filtering procedure as shown below.
The authors, in [Kargupta et al. 2003], focused on the scenario where only a small
√
number of instances exists in the data set. Under this case, we have λEmin = σ 2 (1−1/ q)2
21
input
Y , m × n matrix, a given perturbed data set
σ 2 , variance of the i.i.d. noise with zero mean
output X̂, an estimation of the original data set
BEGIN
1
Compute covariance matrix ΣY by ΣY = Y T Y .
2
Do eigenvalue decomposition on ΣY
ΣY = QY ΛY QTY
Where ΛY = diag(λY1 , λY2 , · · · , λYm ), a diagonal matrix with eigenvalues on
its diagonal (λY1 ≥ λY2 ≥ · · · ≥ λYm ). QY = (eY1 , eY2 , · · · , eYm )T , an orthogonal matrix with column vectors be the corresponding eigenvectors of ΣY .
3
Since the noise matrix E is generated using i.i.d. distribution with zero
mean and known variance, the eigenvalues of its covariance matrix are
bounded by λEmin and λEmax according to the random matrix theory. This
pair of bounds is calculated as:
√
λEmin = σ 2 (1 − 1/ q)2
√
λEmax = σ 2 (1 + 1/ q)2
where q is linear to the ratio between the number of records and the
number of attributes.
4
Extract components of ΣY which are related to the original data. The
noise-related eigenvalues are λEmax ≥ λEi ≥ λEi+1 ≥ · · · ≥ λEj ≥ λEmin .
The remaining k eigenvalues are related to the original data. Let those
corresponding eigenvectors, QYk , forms an orthonormal basis of a subspace χ̃.
The orthogonal projection on to χ̃ is calculated as
Pχ̃ = QYk QTYk
5
Get the estimate data set using X̂ = Y Pχ̃ .
END
Figure 3.2: Spectral Filtering Process
√
and λEmax = σ 2 (1 + 1/ q)2 , where q is linear to the ratio between the number of records
and the number of attributes. The authors developed a method on how to filter out the
k principle eigenvalues. As in most data mining applications, the number of records far
exceeds that of attributes (hence q is large), we can see λEmin ≈ λEmax ≈ σ 2 .
3.2.2
PCA-Based Reconstruction Method
Huang et al. in [Huang, Du and Chen 2005] argued that when the original data is
highly correlated, the information loss could more clearly quantified by their proposed
PCA-based method.
Authors in [Huang, Du and Chen 2005] pointed out that the correlation among the
22
input
Y , m × n matrix, a given perturbed data set
σ 2 , variance of the i.i.d. noise with zero mean
output X̂, an estimation of the original data set
BEGIN
1
Compute covariance matrix ΣY .
2
Conduct PCA on ΣY to get its eigenvalues and eigenvectors.
ΣY = QY ΛY QTY
Where ΛY = diag(λY1 , λY2 , · · · , λYm ), a diagonal matrix with eigenvalues on
its diagonal (λY1 ≥ λY2 ≥ · · · ≥ λYm ). QY = (eY1 , eY2 , · · · , eYm )T , an orthogonal matrix whose column vectors are the corresponding eigenvectors of ΣY .
3
Derive an approximated covariance matrix of the original data
Σ̂X = Q̂X Λ̂X Q̂TX
where Λ̂X = diag(λY1 − σ 2 , λY2 − σ 2 , · · · , λYm − σ 2 ) and Q̂X = QY
4
Extract k principal components of Σ̂X . Let Q̂Xk contains those
corresponding eigenvectors of the principal components.
5
Get the estimate data set as X̂ = Y Q̂Xk Q̂TXk .
END
Figure 3.3: PCA-Based Reconstruction Method
original data and the correlation between the original data and the noise are key factors
affecting the privacy for a randomized data set. Their theoretical discussion and imperial
evaluation indicate that adding correlated noise with correlation matrix close to the one
of the original data might better preserve the privacy.
3.2.3
MLE-Based Reconstruction Method
Huang et al. in [Huang, Du and Chen 2005] also proposed another reconstruction
method based on MLE (Maximum Likelihood Estimation), which can be conducted following the procedure in Fig 3.4.
When the noise has the correlation similar to the one of the original data, authors in
[Huang, Du and Chen 2005] also modified their estimation as
−1 −1
−1
−1
−1
x̂i = (Σ̂−1
X + ΣE ) (Σ̂X µ̂X − ΣE µE + ΣE yi )
23
input
Y , m × n matrix, a given perturbed data set
σ 2 , variance of the i.i.d. noise with zero mean
output X̂, an estimation of the original data set
BEGIN
1
Compute covariance matrix ΣY .
2
Conduct PCA on ΣY to get its eigenvalues and eigenvectors.
ΣY = QY ΛY QTY
Where ΛY = diag(λY1 , λY2 , · · · , λYm ), a diagonal matrix with eigenvalues on
its diagonal (λY1 ≥ λY2 ≥ · · · ≥ λYm ). QY = (eY1 , eY2 , · · · , eYm )T , an orthogonal matrix whose column vectors are the corresponding eigenvectors of ΣY .
3
Derive an approximated covariance matrix of the original data
Σ̂X = Q̂X Λ̂X Q̂TX
where Λ̂X = diag(λY1 − σ 2 , λY2 − σ 2 , · · · , λYm − σ 2 ) and Q̂X = QY
4
Estimate the mean vector of the original data from the perturbed data.
µ̂X = µY
5
Get the estimate data set X̂ with its row vectors to be:
−1
2
−1
2
x̂i = (Σ̂−1
X + 1/σ · I) (Σ̂X µ̂X + yi /σ )
END
Figure 3.4: MLE-Based Reconstruction Method
3.2.4
Privacy Issues
Since the original data might be highly correlated and the noise is usually assumed
to be Gaussian with zero mean, above techniques have been proved to be able to keep
the principal components in the original data and filter the noise as well. One common
idea behind Spectral Filtering technique and PCA-based reconstruction method is that
they both estimate the individual values by projecting the perturbed data on to a space
spanned by some representative eigenvectors. Information in the original data is expected
to be largely preserved by such projection and noise is expected to be reduced to a great
extent. When we consider the scenario with large number of instances in data sets,
λEmin ≈ λEmax ≈ σ 2 . Therefore, these two methods are essentially the same. In this
research, we focus the representative Spectral Filtering technique under such scenario.
By adding one dimension to the projective space χ, more information will be preserved
but more noise will be introduced. So whether should we add one more dimension? Previous works provided different strategies to reconstruct the data. The spectral-filtering-
24
based method keeps principal components of the perturbed data with eigenvalues no
less than the variance of the noise, σ 2 . However, this strategy was not proven to be
optimized. The PCA-based method requires k principal components of the estimated
covariance matrix of the original data. But it did not give an explicit strategy to determine the number of principal components(i.e. k). In this research, we propose an
optimized strategy following the essential idea of the Spectral Filtering technique. We
also notice that previous works in [Huang, Du and Chen 2005; Kargupta et al. 2003]
only empirically assessed the effects of perturbation on the accuracy of the estimated
individual value. In this research, we explore the explicit relation between the estimate
error (X̂ − X) and the noise E and give clear bounds of kX̂ − XkF . The upper and the
lower bound of estimate error are significant for both data miners and attackers. As we
introduced, it is possible to reconstruct the distribution of the original data by giving the
distribution function of the noise [Agrawal and Srikant 2000]. Another challenge here
is whether the reconstructed distribution can be exploited by attackers or snoopers to
threaten sensitive individual privacy. We present one simple attack using Inter-Quantile
Range(IQR)on reconstructed distribution and show the disclosure of individual privacy
from such aggregate information of the reconstructed distribution.
Definition 3.1 Let A ∈ Cm×n . The Frobenius norm of A is the number
v
uX
n
u m X
t
kAkF =
a2ij
i=1 j=1
The 2-norm of A is
kAxk2
kAk2 = max
|{z} kxk2
x6=0
where kxk2 is for the 2-norm (Euclidean norm) of a vector.
Definition 3.1 gives the mathematical form of Frobenius norm and 2-norm, which will
be used in the following part. In this study, we cast many of our analysis in terms of
absolute and relative errors of matrix norm, instead of component-wise bounds. Basically,
the Frobenius form is used to measure the magnitude of data in total while the 2-norm
is used to denote the largest eigenvalue of covariance matrix.
25
We list some properties of matrix norm which will be used in our proof as below. Refer
linear algebra books (e.g., [Stewart and Sun 1990]) for more details.
1. kABkF ≤ kAkF kBkF and kABk2 ≤ kAk2 kBk2 , when B ∈ Cn×q .
2. kAk2 ≤ kAkF ≤
3. kAk2 =
√
nkAk2
p
λmax (AT A), the square root of the largest eigenvalue of AT A
4. if A is symmetric, then kAk2 = λmax (A), the largest eigenvalue of A
Definition 3.2 Let X be a subspace of Cn and let the columns of QX form an orthogonal
basis for X . The matrix PX = QX QTX is called the orthogonal projection onto X .
3.3
An Improved Strategy for Noise Filtering
The original Spectral Filtering algorithm applied the following strategy to determine
the first k eigen components.
Strategy 1 : k = max{i|λYi ≥ λEmax }. When the data set is large, λEmax ≈ λEmin ≈ λE .
It becomes: k = max{i|λYi ≥ λE }
We point out that the previous Strategy 1 applied in [Kargupta et al. 2003] in general
will not give the optimal reconstruction. The reason is that it aims to include all significant eigen components (with λYi > 0) in projection space for reconstruction. However,
since the inclusion of one eigen component also brings some additional noise projected on
that eigen vector, the benefit due to inclusion of one insignificant eigen component may
be diminished by the side effect of the additional noise projected on this eigen vector.
In this research, we propose a new strategy (as shown in Strategy 2) which compares
the benefit due to inclusion of one component with the loss due to the additional projected noise. We show that Strategy 2 is expected to give one approximately optimal
reconstruction. This strategy is also used in our bound analysis.
Strategy 2 : The estimated data using X̂ = Y Pχ̃ = Y QYk Q̃TYk is approximately optimal
when k = max{i|λYi ≥ 2λE }.
26
Proof. In the Spectral Filtering method, when we select the first k components, the
error matrix can be expressed as
f (k) = X̂ − X
= (X + E)QYk QTYk − X


 Ik 0  T
= (X + E)QY 
 QY − X
0 0


 Ik 0  T
T
= EQY 
 QY − X[QY IQY − QY
0 0



0
 Ik 0  T
 0
= EQY 
 QY − XQY 
0 0
0 In−k


 Ik 0  T

 QY ]
0 0

 T
 QY
(3.2)
Similarly, when we select the first k + 1 components, the error matrix becomes




0
 0
 T
 Ik+1 0  T
f (k + 1) = EQY 
 QY
 QY − XQY 
0 In−k−1
0 0




0  T
 0
 Ik 0  T
T
= E[QY 
 QY
 QY + eYk+1 eYk+1 ] − X[QY 
0 In−k
0 0
−eYk+1 eTYk+1 ]




0  T
 0
 Ik 0  T
T
= (EQY 
 QY ) + EeYk+1 eYk+1
 QY − XQY 
0 0
0 In−k
+XeYk+1 eTYk+1
= f (k) + EeYk+1 eTYk+1 + XeYk+1 eTYk+1
(3.3)
The last two parts in Equation 3.3 are the projections of noise and data on the (k+1)th
eigenvector. Assume eYi ≈ eXi , the strength of the data projection can be approximated
27
as
||XeYk+1 eTYk+1 ||2F ≈ ||XeXk+1 eTXk+1 ||2F
= T r[(XeXk+1 eTXk+1 ||2F )T (XeXk+1 eTXk+1 ||2F )]
= T r(eXk+1 eTXk+1 X T XeXk+1 eTXk+1 )
n
X
T
= T r[eXk+1 eXk+1 ( (λXi eXi eTXi )eXk+1 eTXk+1 ]
i=1
= T r(λXk+1 eXk+1 eTXk+1 )
= λXk+1
For i.i.d noise, the effect of the projection on any vector should be the same. Thus,
||EeYk+1 eTYk+1 ||2F ≈ λE
Hence, we include the i-th component only when the following condition satisfied
λXi ≥ λE
(3.4)
The benefit due to inclusion of the i-th eigen component is larger than the loss due to
the noise projected along the i-th eigen component.
Considering data variables xi , xj and zero-mean noise variables ei , ej , where the noise
is independent with the data. According to the definition of covariance and variance, it
is easy to derive
Cov(xi + ei , xj + ej ) = < (xi + ei )(xj + ej ) > − < xi + ei >< xj + ej >
= < xi xj > + < ei xj > + < xi ej > + < ei ej > − < xi >< xj >
= < xi xj > + < ei ej > − < xi >< xj >
= Cov(xi , xj ) + Cov(ei , ej )
Therefore,
V ar(xi + ei ) = V ar(xi ) + V ar(ei )
(3.5)
28
Considering the condition 3.4, we choose λYi to be λYi = λXi + λE ≥ 2λE . Hence
k = max{i|λYi ≥ 2λE }
3.4
Upper Bound Analysis
The traditional matrix perturbation theory [Stewart and Sun 1990] focuses on how the
perturbation B
1
affects the matrix A. Specifically, it provides precise upper bounds on
the eigenvalues, the angle between eigenvectors, or invariance subspaces of a matrix A
and that of its perturbation à = A + B, in terms of the norms of the perturbation matrix
B.
In our scenario,
A = XT X
à = Y T Y = (X + E)T (X + E) = X T X + E T X + X T E + E T E
B = Ã − A = E T X + X T E + E T E
B can be interpreted as a perturbation on the covariance matrix caused by the additive
noise E . The primary perturbation Y , which is obtained from X by the addition of an
explicit perturbation E, is more meaningful to users than the derived perturbation B.
Hence, it is more significant to consider how the primary perturbation E affects the data
matrix X rather than how the derived perturbation B affects the covariance matrix A.
Note that we have E T X = X T E = 0 when the data and noise are uncorrelated. The
above can then be simplified as Y T Y = (X + E)T (X + E) = X T X + E T E.
Since
X̂ − X ≈ Y Pχ̃ − XPχ = Y Pχ̃ − (Y − E)Pχ = Y (Pχ̃ − Pχ ) + EPχ
1
In the book, author uses E to denote such perturbation.
29
hence we have,
||X̂ − X||F ≈ ||Y (Pχ̃ − Pχ ) + EPχ ||F
≤ ||Y (Pχ̃ − Pχ )||F + ||EPχ ||F
≤ ||Y ||F ||Pχ̃ − Pχ ||F + ||EPχ ||F
(3.6)
From Equation 3.6, we can see that the difference between the estimated data set and
the original one is determined by the invariant subspaces Pχ (Pχ̃ ) of A(Ã), it is natural
to access the bias between these subspaces.
Proposition 1 Let A ∈ Rn×n be a symmetric positive definite matrix, and let λ1 ≥
λ2 ≥ · · · ≥ λn be its eigenvalues and e1 , e2 , · · · , en be corresponding n eigenvectors. Let
the matrices X ∈ Rn×(n−k) be defined according to X = [e1 e2 · · · ek ], Y = [ek+1 · · · en ],
so that the matrix [X Y ] ∈ Rn×n is orthogonal and unitary. Given a perturbation B,
let à = A + B, and ² = ||B||F . Let χ and χ̃ be the invariant subspace of A and Ã
respectively. χ is spanned by X. Pχ and Pχ̃ are the corresponding orthogonal projection
onto these invariant subspaces. Define eigengap δ = λk − λk+1 . There exists a matrix P
satisfying
√
||P ||F ≤
2²
√
δ − 2²
so that the columns of X̃ = (X +Y P ) form an orthonormal basis for the subspace spanned
by the first k eigenvectors of Ã.
||Pχ̃ − Pχ ||F ≤
2²
√
δ − 2²
¦
Before we prove this proposition, let’s introduce a lemma first.
Lemma 3.1 Let A ∈ Rn×n be a symmetric positive definite matrix , and let λ1 ≥ λ2 ≥
· · · ≥ λn be its eigenvalues and e1 , e2 , · · · , en be corresponding eigenvectors. Let X =
[e1 e2 · · · ek ], Y = [ek+1 · · · en ] so that the matrix [X Y ] ∈ Rn×n is orthogonal and unitary.
Given a perturbation B, let à = A + B, let ² = ||B||F > 1/2, and define δ = λk − λk+1 .
30
√
²
If δ > 2 2², then there is a matrix P satisfying ||P ||F ≤ 2 δ−√
so that the columns
2²
of X̃ = X + Y P form an orthonormal basis for the subspace spanned by the first k
eigenvectors of Ã, ẽ1 , ẽ2 , · · · , ẽk .
Proof. Since A is a symmetric positive definite matrix, we can apply spectral decomposition on A:
[X Y ]T A[X Y ] = [
L1
0
0
L2
].
(3.7)
where L1 = diag(λ1 , · · · , λk ), and L2 = diag(λk+1 , · · · , λn ). Also, let
B̃ = [X Y ]T B[X Y ] = [
F11 F12
]
(3.8)
F21 F22
From Theorem V.2.8 of [Stewart and Sun 1990], there exists a matrix P satisfying
||P || ≤ 2
||F21 ||
δ − ||F11 || − ||F22 ||
(3.9)
Since [X Y ] is unitary, ² = ||B||F , it holds true that
||B̃||F = ||[X Y ]T B[X Y ]||F = ||B||F = ²
Moreover, since ||F11 ||2F + ||F12 ||2F + ||F21 ||2F + ||F22 ||2F = ||B̃||2F and B̃ is symmetric, we
have
1
||B̃||F
2
1
≤ √ ²
2
≤ 2(||F11 ||2F + ||F22 ||2F )
||F21 ||2F = ||F12 ||2F ≤
(3.10)
||F21 ||F = ||F12 ||F
(3.11)
(||F11 ||F + ||F22 ||F )2
≤ 2||B̃||2F
||F11 ||F + ||F22 ||F
δ − ||F11 ||F − ||F22 ||F
= 2||B||2F
√
≤
2||B||F
√
≥ δ − 2².
(3.12)
(3.13)
31
Hence,
||P ||F ≤
√
2
δ−
²
√
2²
(3.14)
so that the columns of X̃ = (X + Y P ) form an orthonormal basis for the subspace
spanned by three eigenvectors of Ã. The representation of à with respect to X̃ is
L̃1 = L1 + F11 + F12 P
(3.15)
The eigenvalues associated with these k eigenvectors are the eigenvalues of L̃1 , and the
eigenvalues associate with the rest of Ã’s eigenvectors are the eigenvalues of
L̃2 = L2 + F22 − P F12
(3.16)
Thus, to complete the proof of the Lemma, it suffices to verify that the eigenvalues of L̃1
are all (strictly) larger than the eigenvalues of L̃2 .
√
Since δ > 2 2², we have
||P ||F ≤
√
2
δ−
²
√
<1
(3.17)
2||B||F
(3.18)
2²
Similarly, we can derive
||F11 ||F + ||F12 ||F ≤
√
Then we have
||F11 + F12 P ||F ≤ ||F11 ||F + ||E12 P ||F
≤ ||F11 ||F + ||F12 ||F ||P ||F
≤ ||F11 ||F + ||F12 ||F
√
≤
2||B||F
√
=
2².
(3.19)
By the same argument, we also have
||E22 − P E12 ||F ≤
√
2².
(3.20)
32
Since the Forbenius norm upperbounds the Spectral norm, this also shows
||F11 + F12 P ||2 ≤
||F22 − P F12 ||2 ≤
√
√
2²,
(3.21)
2².
(3.22)
Let the eigenvalues of L̃1 are λ̃1 , λ̃2 , · · · , λ̃k , and those of L̃2 are λ̃k+1 , · · · , λ̃n .
The spectral variation of L̃1 with respect to L1 is
k
k
i=1
j=1
svL1 (L̃1 ) = max min |λ̃i − λj |
(3.23)
The spectral variation of L̃2 with respect to L2 is
n
n
svL2 (L̃2 ) = max min |λ̃i − λj |
i=k+1 j=k+1
(3.24)
From Corollary IV.3.4 of [Stewart and Sun 1990]:
svL1 (L̃1 ) ≤ ||F11 + F12 P ||2 ≤
svL2 (L̃2 ) ≤ ||F22 − P F12 ||2 ≤
√
√
2²
(3.25)
2²
(3.26)
√
the above conditions ensure that those eigenvalues of L̃1 lie in the interval [λk − 2², λ1 +
√
√
√
2²], and that those of L̃2 lie in the interval [λn − 2², λk+1 + 2²]. As we know
√
λk − λk+1 = δ > 2 2²
(3.27)
so we have
λk −
√
2² > λk+1 +
√
2²,
(3.28)
which implies that all of L̃1 ’s eigenvalues are strictly larger than all of L̃2 ’s eigenvalues.
¦
Proof of Proposition 1
We can find an invariant subspace of Ã, and its corresponding orthogonal projection
(PX̃ = X̃ X̃ T ). Our aim is to bound ||X̃ − X||F , as well as ||PX − PX̃ ||F .
33
Let M = P T P , then ||M ||F ≤ ||P ||2F ≤
2²2
δ̃ 2
< 1, where δ̃ = δ −
√
2².
||X̃ − X||F = ||(X + Y P )(I + P T P )−1/2 − X||F
= ||(X + Y P )(I − I + (I + P T P )−1/2 ) − X||F
= ||X + Y P − (X + Y P )(I − (I + M )−1/2 ) − X||F
≤ ||Y P ||F + ||(X + Y P )(I − (I + M )−1/2 )||F
= ||P ||F + ||(X + Y P )(I − (I + M )−1/2 )||F
≤ ||P ||F + (||X||F + ||Y P ||F )||(I − (I + M )−1/2 )||F
2²2
δ̃ 2
2²2
||P ||F + (||X||F + ||P ||F )
δ̃ 2
√
√
√
2²
2² 2²2
+( k+
)
δ̃√ δ̃ 2
√δ̃
√
2²
2²
+ ( k + 1)
δ̃
δ̃
√
√
2²
( k + 2)
δ̃
≤ ||P ||F + (||X||F + ||Y P ||F )
=
≤
<
=
According to (pp. 232 of [Stewart and Sun 1990]), we can derive:
√
||PX − PX̃ ||F ≤ 2 2
||F12 ||F
δ − ||F11 ||F − ||F22 ||F
√1 ²
√
2
√
≤ 2 2
δ − 2²
2²
√
=
δ − 2²
(3.29)
Proposition 2 Given a symmetric matrix A ∈ Rn×n and a symmetric perturbation B,
let à = A + B. Let the eigenvalues of B be ε1 ≥ ε2 ≥ · · · ≥ εn . Let λk and λ̃k are
the eigenvalues of A and à respectively, where k = 1, · · · , n, and let δ̃ = λ̃k − λ̃k+1 ,
δE = ε1 − εn , then
δ ∈ [δ̃ − δE , δ̃ + δE ]
¦
34
Proof. From Corollary 4.9 in [Stewart and Sun 1990], we have:
λk ∈ [λ̃k − ε1 , λ̃k − εn ]
(3.30)
λk+1 ∈ [λ̃k+1 − ε1 , λ̃k+1 − εn ]
Since δ = λk − λk+1 ,
δ ≥ (λ̃k − ε1 ) − (λ̃k+1 − εn ) = (λ̃k − λ̃k+1 ) − (ε1 − εn ) = δ̃ − δE
δ ≤ (λ̃k − εn ) − (λ̃k+1 − ε1 ) = (λ̃k − λ̃k+1 ) + (ε1 − εn ) = δ̃ + δE
Theorem 3.1 Given a data set X ∈ Rm×n and a perturbation noise set E ∈ Rm×n , let
Y = X + E and X̂ to denote the estimate obtained from spectral-filtering-based method.
We have
kX̂ − XkF ≤ kY kF
2||B||F
√
+ kEPχ kF
(λ̃k − ||B||2 ) − 2||B||F
(3.31)
where B = E T X + X T E + E T E is the derived perturbation on covariance matrix
A = X T X.
¦
Proof. From Proposition 1, ||Pχ̃ − Pχ ||F ≤
2²
√ ,
δ− 2²
we have
kX̂ − XkF ≤ kY kF kPχ̃ − Pχ )kF + kEPχ kF
≤ kY kF k
2²
√ + kEPχ kF
δ − 2²
Since the original data are correlated, the rest of the eigenvalues, λXk+1 · · · λXn , are
close to 0. Therefore δ ≈ λXk . From Proposition 2, we know λXk ∈ [λYk − ε1 , λYk − εn ].
As kBk2 = ε1 (Property 4), we have
λXk ≥ λYk − ε1 = λYk − ||B||2
35
Hence,
2||B||F
√
+ kEPχ kF
δ − 2||B||F
2||B||F
√
≤ kY kF
+ kEPχ kF
(λ̃k − ||B||2 ) − 2||B||F
kX̂ − XkF ≤ kY kF
Corollary 1 If the noise is generated by i.i.d. Gaussian distribution with zero mean and
known variance σ 2 , the upper bound of the reconstruction error can be expressed as
p
2kY kF kEk2F
√
k/n||E||F (Strategy 1)
+
(λYk − ||B||2 ) − 2kEk2F
p
2kY kF kEk2F
√
≤
k/n||E||F
(Strategy 2)
+
δ̃ − 2||E||2F
kX̂ − XkF ≤
(3.32)
kX̂ − XkF
(3.33)
Proof. For the Strategy 1, k always equals to the number of principal components in the
original data set. If the original data are highly correlated, the rest of the eigenvalues,
λXk+1 · · · λXn , are close to 0. Therefore δ ≈ λXk . From Proposition 2, the eigen gap for
this strategy can be bounded as
δ ≈ λXk ≥ λYk − ε1
Hence, as ||B||2 = ε1 , 3.31 becomes
kX̂ − XkF ≤ kY kF
2||B||F
√
+ kEPχ kF
(λ̃k − ||B||2 ) − 2||B||F
In general, the covariance of the noise can be expressed as B = E T X + X T E + E T E.
When the noise and signal are completely independent, the above can be simplified as
B = E T E. In terms of Frobenius norm, we have kBkF = kE T EkF ≤ kEk2F .
When the noise matrix is generated by i.i.d. Gaussian distribution with zero mean
and known variance σ 2 , the square error of EPχ is δ 2 = σ 2 nk [Huang, Du and Chen 2005],
36
and ||E||F is
√
σ 2 mn, we have
√
δ 2 mn
r
k
=
σ 2 mn
r n
k
||E||2F
=
rn
k
=
||E||F
n
kEPχ kF =
Then Equation 3.31 becomes
kX̂ − XkF ≤ kY kF
p
2kEk2F
√
k/n||E||F
+
(λ̃k − ||B||2 ) − 2kEk2F
For the Strategy 2, when the noise is i.i.d. Gaussian noise, then ε1 ≈ εn for a large
population. In other words, δE in 3.31 is close to zero. Hence 3.31 becomes
p
2kEk2F
√
+
k/n||E||F
kX̂ − XkF ≤ kY kF
δ̃ − 2kEk2F
When the noise is completely correlated with data, kEPχ kF ≈ kEkF as k represents
the number of principal components. Hence we have ||XPχ ||F ≈ kXkF
. Then Equation 3.31 becomes
kX̂ − XkF ≤ kY kF
(λYk
2kEk2F
√
+ ||E||F
− ||B||2 ) − 2kEk2F
(3.34)
The upper bound given in Theorem 3.1 determines how close the estimated data
achieved by attackers is from the original one when spectral-filtering-based method is
exploited by attackers. This represents a serious threat of privacy breaches as attackers
know exactly how close their estimates are. Please note ||E||F and kBk2 are assumed to
be available to attackers as they can easily be computed from the published information
about noise distribution.
37
3.5
Lower Bound Analysis
Let Y = X + E be a perturbation of X, and let


 ΣY 
LTY Y RY = 

0
be the Singular Value Decomposition(SVD) of Y . Weyl and Mirsky gave us the basic
perturbation bounds for the singular values of above matrix.
Theorem 3.2 (Weyl) [Weyl 1911]
|σYi − σXi | ≤ ||E||2
i = 1, · · · , n
Theorem 3.3 (Mirsky) [Mirsky 1960]
s
X
(σYi − σXi )2 ≤ ||E||F
i = 1, · · · , n
i
Let B be any matrix of rank not greater than k and its singular values are denoted as
ψ1 ≥ · · · ≥ ψn . Based on Mirsky’s theorem, the sum of squares of the k smallest singular
values of A is not greater than ||B − A||2F . Such conclusion can be expressed as:
||B −
A||2F
≥
n
X
|ψi − σXi |2
i=1
2
2
≥ σX
+ · · · + σX
n
k+1
≥ ||Xk − X||2F
Enlightened by the perturbation bounds for the singular values of a matrix, we propose a SVD based reconstruction method so that the error can be lower bounded. In
this research, we further analyze our SVD based reconstruction method as well as the
spectral filtering technique and prove their equivalence. Then the lower bound of the
reconstruction error by using spectral filtering technique is derived.
3.5.1
SVD-based Reconstruction Method
Singular Value Decomposition (SVD) decomposes a matrix X ∈ Rm×n (say m ≥ n)
into the product of two unitary matrices, LX ∈ Rm×m ,RX ∈ Rn×n , and a pseudo-diagonal
38
input
Y , a given perturbed data set
E, a noise data set
output X̂, a reconstructed data
BEGIN
1
Apply SVD on Y to get Y = LY DY RYT
2
Apply SVD on E and assume σEmax ≈ σEmin ≈ σE
√
3
Determine the first k components of Y by k = max{i|σYi ≥ 2σE }
Assume σY1 ≥ σY2 ≥ · · · σYk and lYi , rYi are the corresponding left
and right singular vectors
4
Reconstructing X approximately as
k
P
X̂ = Yk =
σYi lYi rYTi (k ≤ ρ)
END
i=1
Figure 3.5: SVD Based Reconstruction Algorithm
matrix DX = diag(σX1 , · · · , σXρ ) ∈ Rm×n , such that
T
X = LX DX RX
or
X=
n
X
T
σXi lXi rX
i
i=1
The diagonal elements σXi of DX are referred to as singular values, which are, by
convention, sorted in descending order: σX1 ≥ σX2 ≥ · · · ≥ σXn ≥ 0. The columns lXi
and rXi of LX and RX are respectively called the left and right singular vectors of X.
Similarly let Y = X+E be a perturbation of X and let Y = LY DY RYT be a perturbation
of Y .
Figure 3.5 shows our SVD based reconstruction method. Please note that the strategy
√
used for the SVD based reconstruction is k = max{i|σYi ≥ 2σE }, where the largest sin√
gular value of the added noise is calculated as σE ≈ ||E||F / n (step 2 in the algorithm).
√
For i.i.d. noise, we have ||E||F = mnσ 2 , where the covariance of the noise is σ 2 =
39
λ(E T E)
m−1
=
||E||22
.
m−1
As we know, the largest singular value σv is ||E||2 . Hence, we have
√
||E||F m − 1
||E||F
√
σE = ||E||2 =
≈ √
mn
n
3.5.2
Lower Bound
Consider X̂ = Yk = LYk DYk RYTk as the estimation of the original data set X. The
estimation error between X̂ and X has its lower bound:
||X̂ − X||F ≥ ||Xk − X||F
where k = max{i|σYi ≥
√
2σE }.
The relationship between the reconstruction bias and perturbation (especially the lower
bound) will, in turn, guide us to add noise into the original data set. The lower bound
gives data owners the worst case security assurance since for any matrix B (with singular
values ψ) of rank not greater than k derived by attackers, we have
kB − XkF ≥
n
X
2
2
| ψi − σXi |2 ≥ σX
+ · · · + σX
≥ kXk − Xk2F
n
k+1
i=1
In order to preserve privacy, data owners need to make sure ||X̂ −X||F /||X||F is greater
than the privacy threshold τ , specified by users. In the following, we are going to answer
how to determine the magnitude of noise to satisfy one privacy threshold.
Based on the derived lower bound,
2
2
τ ||X||F ≤ ||Xk − X||F = σX
+ · · · σX
n
k+1
Hence k which might be chosen by attackers can be determined by
2
2
k = max{i|τ ≤ (σX
+ · · · σX
)/||X||F }
n
i+1
(3.35)
Based on our approximate optimal strategy,λXi ≥ λE , the data owner should add an
i.i.d. noise E and let the eigenvalue of (E T E) satisfy
λXk+1 < λE ≤ λXk
(3.36)
Since λE is the eigenvalue of E T E, the variance of the noise can be derived V ar(E) =
40
λE /(m − 1), where m is the number of row in E.
3.5.3
Equivalence of Two Reconstruction Methods
SVD explicitly constructs orthonormal bases for the nullspace and range of a matrix,
T
X = LX DX RX
. The non-zero singular values for X are precisely the square roots of the
non-zero eigenvalues of the positive semi-definite matrix XX T , and these are precisely
the square roots of the non-zero eigenvalues of X T X. Furthermore, the columns of LX
are eigenvectors of XX T and the columns of RX are eigenvectors of X T X.
Theorem 3.4 The reconstructed data from Spectral Filtering is
X̂SF = Y Pχ̃ = Y QYk QTYk
where k = max{i|λYi ≥ 2λE } while the reconstructed data from SVD is
X̂SV D = LYk DYk RYTk
where k = max{i|σYi ≥
k = max{i|λYi
√
2σE }. We have X̂SF = X̂SV D and the k determined by
√
≥ 2λE } and determined by k = max{i|σYi ≥ 2σE } are exactly the
same.


 Ik 
Proof. We first prove these two methods are equivalent. Since RYk = RY 
:
0






 Ik 
 Ik 
 Ik 
T
Y RYk = Y RY 
 = (LY DY RY )RY 
 = LY DY 
 = LYk DYk
0
0
0
Since the columns of right singular vectors (RY ) are the eigenvectors of Y T Y , that is
QY = RY . Then
X̂SF = Y RYk RYTk = LYk DYk RYTk = X̂SV D
We then prove the equivalence of determining k.
41
Based on the fact that the singular value of X are the square root of eigenvalues of
X T X or XX T , we have:
σYi =
√
2σE
p
λi (Y T Y ) =
p
λYi
p
p
=
2λ(E T E) = 2λE
so,
σYi <
√
2σE ⇐⇒ λYi < 2λE
Hence
max{i|σYi ≥
3.6
√
2σE } = max{i|λYi ≥ 2λE }
Potential Attack Based on Distribution
In the previous sections, we discuss one kind of approaches which generally attempts to
hide the sensitive data by randomly modifying the data values using some additive noise
and aims to reconstruct the original distribution closely at an aggregate level. The aggregate privacy preserved in the model is investigated by exploring the upper bound and the
lower bound of estimation bias. However, another challenge here is whether the reconstructed distribution can be exploited by attackers or snoopers to derive sensitive individual data. This section presents one simple attack using Inter-Quantile Range(IQR)on
reconstructed distribution and shows the disclosure of individual privacy from such aggregate information of the reconstructed distribution.
Let us consider a scenario where we have a set of n original data values x1 , · · · , xn .
Each xi is associated with one privacy interval [wil , wiu ], where xi ∈ [wil , wiu ]. The privacy
interval [wil , wiu ] represents the privacy requirement of sensitive data xi pre-defined by its
owner. Please note that this privacy interval [wil , wiu ] is specified by its data owner and the
data holder is required to satisfy all individual’s privacy concerns although different data
owners may have different privacy concerns (some of them may be even unreasonably
restrictive) on their individual data. In other words, the data owner disallows attackers
or snoopers to derive or estimate this sensitive data falling into its privacy interval.
Most existing randomization based approaches add a random number ei , which is
42
drawn from some known distribution, to xi , the value of a sensitive attribute. The randomized value yi = xi +ei is then released. To preserve the privacy, noise distribution of v
is expected to be large so all yi (or a large fraction of yi in practice) satisfy yi ∈
/ [wil , wiu ].
However, the reconstructed distribution F̂X also provides a certain level of knowledge
which can be exploited by attackers or snoopers to estimate individual values with a
higher level of accuracy. For example, from the reconstructed distribution, snoopers may
learn one aggregate information such as ”95% customers from 28223 zip code and with
Asian background have wages [70k, 80k]”, then they can safely conclude that the customer’s wage lies in [70k, 80k] with 95% confidence level once one customer is determined
to be from this class. If [70k, 80k] happens to completely lie in this customer’s privacy
interval, we say disclosure happens.
3.6.1
Quantification of Privacy
Authors in [Agrawal and Agrawal 2001] proposed a metric to measure privacy based
on Shannon’s information theory [Shannon 1948; Shannon 1949]. As another approach
to measure privacy, the theory of coalitional games is applied to determine the cost of
each piece of information. In [Agrawal and Srikant 2000] privacy is measured in terms of
confidence intervals. [Rizvi and Haritsa 2002] also suggested its own way of measuring
privacy.
As pointed out earlier, the reconstruction of the data distribution provides a certain
level of knowledge which can be used by attackers or snoopers to estimate a data value
to a higher level of accuracy. In this part, we propose a new measure to quantify privacy
which will also be used to control the disclosure in Chapter 5.
Definition 3.3 Quantile [Conover 1998]. A random variable X is defined by a distribution function FX (or a probability density function fX ). The number xp , for a given
value of p between 0 and 1, is called the p-th quantile of the random variable X, if
Rx
P (X < xp ) ≤ p and P (X > xp ) ≤ 1 − p, where P (X < x) = FX (x) = −∞ fX (²)d².
Definition 3.4 Inter-Quantile Range (IQR) [Conover 1998] 2 . The Inter-Quantile Range
2
Inter-Quantile is a general case of Inter-Quartile Range which only considers 1/4, 1/2, 3/4 and 1
quantile points.
43
[xα1 ,xα2 ] is defined as P (xα1 ≤ x ≤ xα2 ) ≥ c%, while c = α2 − α1 denotes the confidence.
The IQR [xα1 , xα2 ] is used to measure the amount of spread and variability of the
random variable. Hence, it can be used by attackers or snoopers to estimate the range
of each individual data, xi , with confidence c = α2 − α1 . For a given c (e.g., 95%),
the range [α1 , α2 ] is not unique. We use [(1 − c)/2, (1 + c)/2] in this study whereas the
corresponding IQR range is [x(1−c)/2 , x(1+c)/2 ].
The authors, in [Agrawal and Srikant 2000], use a similar measure that defines privacy
as follows: if the original value can be estimated with c confidence to lie in the interval
[xα , xβ ], then the interval width xβ −xα defines the amount of privacy at c confidence level.
Please note that the confidence interval here is different from the classic one defined in
statistics where a confidence interval gives an estimated range of values which is likely to
include an unknown population parameter (e.g., mean value), the estimated range being
calculated from a given set of sample data. If independent samples are taken repeatedly
from the same population, and a confidence interval calculated for each sample, then a
certain percentage of the intervals will include the unknown population parameter. In
our scenario, the confidence interval means coverage range. In other words, if we know
that [xα1 , xα2 ] covers c percentage of data, we can say [xα1 , xα2 ] covers a given data with
c confidence.
If the estimated range [x(1−c)/2 , x(1+c)/2 ] contains the individual’s private data xi and
fully falls within individual’s privacy interval [wil , wiu ], we say that this individual data is
fully disclosed by IQR attack. In general, we use
di =
[wil , wiu ] ∩ [x(1−c)/2 , x(1+c)/2 ]
[wil , wiu ] ∪ [x(1−c)/2 , x(1+c)/2 ]
(3.37)
to measure how close the IQR obtained by attackers or snoopers is from individual’s
privacy interval. In our experiments, we compute both the number of individuals with
fully disclosure and the average disclosure using
Pn
D=
i=1
n
di
(3.38)
44
3.6.2
Extension to Multiple Confidential Attributes
When multiple confidential attributes exist, we can extend IQR for each single numerical attribute to confidence region for all attributes together. In practice, the distribution
of multiple numerical attributes are often modeled by one multi-variate normal distribution, N (µ, Σ), where µ denotes a vector of means and Σ denotes a covariance matrix. In
the p-dimensional space, the confidence region will be an ellipsoidal region given by its
probability density contour.
In this part we present some known results about density contour of multi-variate
normal distribution from statistics.
Result 1 (Constant probability density contour) ([Johnson and Wichern 1998],
page 134) Let Z be distributed as Np (µ, Σ) with |Σ| > 0. Then, the Np (µ, Σ) distribution
assigned probability 1 − α to the solid ellipsoid {z : (z − µ)T Σ−1 (z − µ) ≤ χ2p (α)}, where
χ2p (α) denotes the upper (100α-th) percentile of the χ2p distribution with p degrees of
√
freedom. The ellipsoid is centered at µ and have axes ±c λi ei , where c2 = χ2p (α) and
Σei = λi ei , i = 1, · · · , p.
The multi-variate normal density is constant on surfaces where the squared distance
(z − µ)T Σ−1 (z − µ) is constant c2 . The chi-square distribution determines the variability
of the sample variance. Probabilities are represented by volumes under the surface over
regions defined by intervals of the zi values. The axes of each ellipsoid of constant density
are in the direction of the eigenvectors of Σ−1 and their lengths are proportional to the
square roots of the eigenvalues (λi ) of Σ.
Result 2 (Volume of Ellipsoid) ([Grotschel , Lovasz and Schrijver 1988]) The volume of an ellipsoid {z : (z − µ)T A−1 (z − µ) ≤ 1} determined by one positive definite
p × p matrix A is given by vol(E) = η| A1/2 |, where η is the volume of the unit ball in
Rp .
Result 3 shows the general result concerning the projection of an ellipsoid onto a line
in a p-dimensional space.
45
Result 3 (Projection of Ellipsoid) ([Johnson and Wichern 1998], page 203) For a
given vector ` 6= 0, and z belonging to the ellipsoid {z : z T A−1 z ≤ c2 } determined by
a positive definite p × p matrix A, the projection (shadow) of {z T A−1 z ≤ c2 } on ` is
q
√
T
`T A`
c ``T A`
`
which
extends
from
0
along
`
with
length
c
. When ` is a unit vector, the
`
`T `
√
√
shadow extends c `T A` units, so | zT ` |≤ c `T A`.
3.7
Evaluation
In our experiment, we use ae(X, X̂) = kX̂ − XkF , the absolute error, and re(X, X̂) =
kX̂ − XkF /kXkF , the relative error in X̂ regarded as an approximation to X.
3.7.1
Scenario of Adding Noise
In [Kargupta et al. 2003], the noise is assumed as following i.i.d. Gaussian distribution
with mean zero and known variance (hence the noise is completely uncorrelated with
data). In this section, we consider different scenarios of noise addition.
• Case 1. E is an additive noise following i.i.d. Gaussian distribution N (0, ΣE ), where
covariance matrix ΣE = diag(σ 2 , · · · , σ 2 ) (The same as in [Kargupta et al. 2003]).
• Case 2. E is an additive noise following Gaussian distribution N (0, ΣE ), where covariance matrix ΣE = c × diag(σ12 , σ22 · · · , σn2 ). Here each feature is applied with a
separate Gaussian distribution with its variance linear with the variance of original
data.
• Case 3. E is an additive noise following Gaussian distribution N (0, ΣE ), where
co-variance matrix ΣE = c × ΣX . ΣX is the covariance matrix of the original data
set. Here the covariance matrix of noise is linear with that of original data. In
other words, the noise is completely correlated with data.
Case 1 represents the scenario where the noise is completely independent with original
data. One example of this scenario is the online collection of customer’s individual data
(as the other customer’s data is unknown during data collection). Case 2 represents the
variance of the original data is a-priori known while case 3 represents the whole covariance
matrix of the original data is used for noise generation. Note in all above three cases, we
noise
||E||F /||X||F
variance
k=1
k=2
Type 1 k=3
re(X, X̂) k=4
k=5
k=6
k=7
c
k=1
k=2
Type 2 k=3
re(X, X̂) k=4
k=5
k=6
k=7
c
k=1
k=2
Type 3 k=3
re(X, X̂) k=4
k=5
k=6
k=7
E1
0.628
0.213
0.821
0.649
0.440
0.297
0.271
*†0.260
0.282
0.402
0.826
0.654
0.452
0.309
0.279
†0.255
0.294
0.402
0.893
0.800
0.702
†0.650
0.636
0.627
0.627
E2
0.786
0.333
0.825
0.659
0.461
0.337
*0.324
†0.325
0.353
0.630
0.832
0.667
0.479
0.353
†0.345
0.391
0.431
0.630
0.935
0.879
†0.824
0.797
0.788
0.783
0.783
E3
0.954
0.491
0.830
0.671
0.488
*0.383
0.383
†0.395
0.428
0.927
0.841
0.684
0.513
†0.405
0.462
0.512
0.558
0.927
0.989
0.977
†0.964
0.961
0.956
0.955
0.955
E4
1.178
0.750
0.839
0.692
0.529
*0.450
0.465
†0.489
0.530
1.415
0.854
0.709
0.564
†0.479
0.552
0.616
0.673
1.415
1.067
1.117
†1.156
1.177
1.177
1.179
1.179
E5
1.366
1.007
0.847
0.711
0.565
*0.506
0.532
†0.567
0.614
1.903
0.868
0.748
0.613
†0.544
0.631
0.706
0.774
1.903
1.140
†1.240
1.318
1.358
1.361
1.366
1.366
E6
1.677
1.524
0.863
0.750
0.636
*0.607
0.651
†0.699
0.757
2.864
0.897
0.819
0.697
†0.652
0.761
0.856
0.939
2.864
1.276
†1.455
1.593
1.659
1.667
1.675
1.675
E7
1.944
2.040
0.877
0.783
0.694
*0.687
0.745
†0.805
0.873
3.850
0.926
0.876
†0.778
0.900
1.008
1.103
1.190
3.850
†1.398
1.644
1.830
1.918
1.930
1.940
1.940
E8
2.121
2.430
0.890
0.810
*0.739
0.748
0.816
†0.883
0.956
4.583
0.945
0.911
†0.830
0.967
1.085
1.190
1.286
4.583
†1.485
1.769
1.981
2.083
2.097
2.109
2.109
E9
2.985
4.814
0.960
*0.956
0.964
1.032
1.141
†1.245
1.348
9.080
†1.072
1.125
1.120
1.317
1.487
1.634
1.777
9.080
†1.926
2.420
2.779
2.943
2.968
2.987
2.987
Table 3.1: The relative error re(X, X̂) vs. varying E under three scenarios (Type 1, 2, and 3) for the PATTERNS data set. The values
with ∗ denote the results following Strategy 2, while the values with † denote the results following the Strategy 1. The bold values indicate
those best estimations achieved by the Spectral Filtering technique.
46
noise
||E||F /||X||F
variance
k=1
k=2
Type 1 k=3
re(X, X̂) k=4
k=5
k=6
c
k=1
k=2
Type 2 k=3
re(X, X̂) k=4
k=5
k=6
c
k=1
k=2
Type 3 k=3
re(X, X̂) k=4
k=5
k=6
E1
0.172
0.005
0.267
0.212
0.185
0.176
0.173
*†0.172
0.302
0.274
0.231
0.207
†0.193
0.182
0.172
0.302
0.276
0.233
0.208
†0.193
0.183
0.172
E2
0.176
0.0052
0.268
0.213
0.186
0.178
*0.176
†0.176
0.312
0.275
0.233
0.210
†0.196
0.186
0.176
0.312
0.276
0.235
0.210
†0.196
0.186
0.176
E3
0.188
0.006
0.269
0.217
0.192
*0.186
0.186
†0.188
0.362
0.276
0.238
†0.218
0.205
0.196
0.187
0.362
0.279
0.240
†0.218
0.206
0.196
0.188
E4
0.218
0.008
0.273
0.226
0.207
*0.206
0.211
†0.218
0.604
0.283
0.254
†0.240
0.231
0.224
0.218
0.604
0.286
0.256
†0.239
0.230
0.223
0.217
E5
0.231
0.009
0.275
0.230
*0.214
0.216
0.223
†0.231
1.200
0.286
0.260
†0.249
0.241
0.235
0.231
1.200
0.289
0.263
†0.249
0.242
0.236
0.231
E6
0.243
0.01
0.276
0.234
*0.221
0.225
0.233
†0.243
3.028
0.289
0.267
†0.258
0.252
0.246
0.242
3.028
0.292
0.271
†0.259
0.253
0.248
0.243
E7
0.266
0.012
0.280
0.243
*0.234
0.241
0.253
†0.266
6.037
0.294
0.281
†0.276
0.272
0.269
0.266
6.037
0.298
0.284
†0.276
0.272
0.269
0.266
E8
0.297
0.015
0.285
0.254
*0.254
0.265
0.281
†0.297
12.15
0.303
0.299
†0.300
0.299
0.297
0.297
12.15
0.308
0.302
†0.300
0.298
0.297
0.296
E9
0.326
0.018
0.290
*0.266
0.269
0.286
0.306
†0.326
30.26
0.312
†0.317
0.324
0.325
0.325
0.326
30.26
0.317
†0.321
0.323
0.325
0.325
0.326
Table 3.2: The relative error re(X, X̂) vs. varying E under three scenarios (Type 1, 2, and 3) for the ADULT data set.
47
48
assume the noise is generated with Gaussian distribution and its associated mean vector
is zero. This assumption is generally true in privacy preserving data mining applications
as the change of mean will significantly affect the accuracy of data mining results.
√
From the discussion in Section 3.4, for case 1, we have ||E||F ≈ σ 2 mn while for both
case 2 and 3,
q
||E||F ≈
c(σ12 + σ22 · · · + σn2 )m
Hence, we can derive
c≈
||E||2F
(σ12 + σ22 · · · + σn2 )m
(3.39)
In our following experiments, we perturb the original data by different level of noises,
which are generated by varying the covariance matrix ΣE , for all three cases (for case 2
and case 3, we derive the corresponding c from the given kEkF using Equation 3.39). For
each perturbed data, we use our spectral filtering technique to reconstruct the point-wise
data. We also show how the reconstruction accuracy is affected by varying k. Table 3.1
shows all experimental results.
3.7.2
Effect of Varying the Number of Principal Components
In Section 3.3, we present one heuristic how to determine k by examining the eigenvalues of covariance matrix of perturbed data and the eigenvalues of covariance matrix
of noise. It is easy to see different k leads to different reconstruction errors (which are
measured by ||X − X̂||F /||X||F ). From Table 3.1, we can see our spectral filtering method
can achieve optimal results for relative small perturbations under both Case 1 and Case
2 (we will explain why case 3 is different in Section 3.7.4). Note the values in bold font
highlight the results achieved by our algorithm while the values with * denote the optimal
results.
When we examine the original data, there exist 4 principle components as the data
is highly correlated among 35 features. Hence, for relative small perturbations, the
effects on the remaining 31 components are safely filtered in both case 1 and case 2.
However, when we increase the noise level (i.e., kEkF increases), the noise will tend
49
to affect the determination of k. This is because the gain of correct inclusion of some
(not very significant) principal component is diminished by the loss of noise due to the
inclusion of that component. Figure 3.6 shows the reconstruction of original sinusoidal
data (attribute 2) with varying k when σ 2 = 0.5 under case 1. When we choose k = 4,
the filtering provides an accurate estimate of the individual data while the reconstruction
accuracy is poor when we choose k = 1.
3
2
1
0
−1
−2
−3
Original Data
Perturbed Data
Estimate Data
2
Value of V2 (k=4)
Value of V2 (k=1)
3
Original Data
Perturbed Data
Estimate Data
1
0
−1
−2
0
50
100
150
200
Number of Instance
250
300
−3
0
50
100
(a) k = 1
150
200
Number of Instance
250
300
(b) k = 4
3
Original Data
Perturbed Data
Estimate Data
Value of V2 (k=5)
2
1
0
−1
−2
−3
0
50
100
150
200
Number of Instance
250
300
(c) k = 5
Figure 3.6: Reconstruction accuracy (data distribution for attribute 2) vs. varying k
with σ 2 = 0.5
Table 3.2 shows our experimental results on the relative error re(X, X̂) with three
scenarios (Type 1, 2, and 3 noises) for the Adult data set. We have similar observations
as those on the Patterns data set. For example, Strategy 2 can always achieve optimal
estimates for the i.i.d. noise (Type 1) while Strategy 1 usually incurs more inaccuracies
since it tends to include all major components (6 in this data set) without considering
50
the side effect incurred by the inclusion of noise. We can also observe in general that the
more noise we add, the greater the reconstruction error. This observation is held across
all three types of noises.
3.7.3
Effect of Varying Noise
In the next experiments, we vary the variance of the added noise from 0.213 (E1) to
4.814 (E9) as shown in Table 3.1. We denote the values with ∗ as the results following
the Strategy 2, while the values with † as the results following the Strategy 1. For each
noise data set, we also show all the relative reconstruction errors by varying k values.
The values in bold font highlight the best results achieved by varying k.
From Table 3.1, we can see that our Strategy 2 can achieve optimal results for all perturbations from E1 to E9. The Strategy 2 can match the best results while the Strategy
1 suffers when relative large perturbations are introduced. The Spectral Filtering with
the Strategy 1 always include all 6 principle components in the projection space across
all 9 noise data sets. On the contrary, the Strategy 2 compares the magnitude of the
principle components with the magnitude of noise added to determine k. For example,
the best k value for noise E4 is 4 as shown in Table 3.1. The reason is that the magnitude
of the last two principle components is not as significant as that of noise projected along
the corresponding components. Hence, the gain of inclusion of the last two (not very
significant) principal components is diminished by the loss due to the inclusion of noise
projected on those components.
Quality of the data reconstruction depends upon the relative noise contained in the
perturbed data. As the noise added to the actual value increases, the reconstruction
accuracy decreases. Figure 3.7 shows point-wise data distributions of reconstruction
for feature 2 (we get a sample of 300 data records) when we vary the noise level. We
can see when the noise-to-signal ratio kEkF /kXkF is 0.628, (the corresponding variance
σ 2 = 0.213), the Spectral Filtering technique can achieve relatively accurate estimation
because the effects due to the noise projection on the remaining 29 components are safely
filtered. When we increase the noise-to-signal ratio to 1.366 (the corresponding noise
variance is σ 2 = 1.007), the reconstruction accuracy decreases as shown in Figure 3.7(b).
51
3
3
Original Data
Perturbed Data
Estimate Data
2
1
Value of D2
Value of D2
1
0
0
−1
−1
−2
−2
−3
Original Data
Perturbed Data
Estimate Data
2
0
50
100
150
200
Number of Instance
250
300
−3
(a) kEkF /kXkF = 0.628
0
50
100
150
200
Number of Instance
250
300
(b) kEkF /kXkF = 1.366
Figure 3.7: Reconstruction accuracy (point-wise data distribution for attribute 2) with
best k vs. varying noise magnitude
The reasons are two-fold. First, much larger noise exists in the projection space. Second,
information contained in those principle components excluded from the projection space
is lost since the large noise tends to affect the determination of k.
3.7.4
Effect of Covariance Matrix of the Noise
From Table 3.1, we can see the spectral filtering method generally cannot achieve good
results for case 3 where the noise covariance matrix is linear with the data covariance
matrix. As the noise is not randomly generated, spectral filtering technique, which is
random matrix based, can not satisfactorily separate noise from data as they share the
same distribution pattern. Figure 3.8 compares the reconstruction accuracy for attribute
2 with the same kEkF = 323 under three cases. We can see spectral filtering performs
best for completely random perturbation (case 1) while performs worse for completely
correlated perturbation (case 3). We would point out it may also affect the accuracy of
data mining significantly when the noise added can not be separated from the original
data.
3.7.5
Utility
To measure the utility, we apply the universal information loss I(fX , fˆX ). To derive
the density distribution fX (x) of the original data and fˆX (x) of the corresponding reconstructed data, we equally divide each dimension into 5 bins and compare the multidimen-
52
1.5
2
Original Data
Perturbed Data
Estimate Data
1.5
0.5
0.5
Value of D2
Value of D2
1
0
−0.5
0
−0.5
−1
−1
−1.5
−1.5
−2
Original Data
Perturbed Data
Estimate Data
1
0
50
100
150
200
Number of Instance
250
300
−2
0
50
100
(a) Case 1
150
200
Number of Instance
250
300
(b) Case 2
2
Original Data
Perturbed Data
Estimate Data
Value of D2
1
0
−1
−2
−3
0
50
100
150
200
Number of Instance
250
300
(c) Case 3
Figure 3.8: Reconstruction accuracy (data distribution for attribute 2) with kEk = 323
under three cases
sional histograms based on the frequency information contained in those 56 6-dimensional
bins. Table 3.7.5 shows our results on the utility loss of the reconstructed adult data
with different levels of Type 1 noise (E1 to E9). From Table 3.7.5 and Figure 3.9 (where
we increase the magnitude of the noise), we can observe that Strategy 2 always outperforms Strategy 1 in terms of preserving utility. Another observation is that in general
the greater the magnitude of the noise, the less utility we can preserve.
To evaluate how different types of noise (Type 1, 2, and 3) affect the utility of the
reconstruction. We show one result on the relationship between the utility vs. varying
three types of noises. We can observe that the spectral based reconstruction method best
preserves the utility with Type 1 noise (i.i.d.) while it incurs the largest utility loss with
Type 3 noise (completely correlated). This is because the completely correlated noise
53
Table 3.3: Utility of Reconstructed Adult Data with Type 1 Noise.
noise
||E||F /||X||F
k=1
k=2
k=3
Utility loss
k=4
I(fX , fˆX )
k=5
k=6
E1
0.172
0.137
0.292
0.261
0.259
0.254
*†0.515
E2
0.176
0.137
0.232
0.270
0.213
*0.208
†0.501
E3
0.188
0.152
0.233
0.084
*0.082
0.077
†0.521
E4
0.218
0.309
0.078
0.051
*0.050
0.045
†0.512
E5
0.231
0.089
0.011
*0.221
0.125
0.116
†0.499
E6
0.243
0.137
0.292
*0.196
0.170
0.164
†0.466
E7
0.266
0.306
0.152
*0.143
0.141
0.137
†0.466
E8
0.297
0.152
0.297
*0.289
0.125
0.119
†0.462
E9
0.326
0.088
*0.156
0.226
0.213
0.086
†0.479
Strategy Strategy Information Loss
The strength of the noise (||E||F/||X||F)
Figure 3.9: Utility vs varying noise with type 1
cannot be well filtered out by the Spectral based reconstruction method although some
statistical properties (e.g., the covariance matrix) can be fully preserved.
3.7.6
Lower Bound vs. Privacy Threshold
This experiment illustrates how to add noise to satisfy users privacy threshold using
the lower bound derived in Section 3.5. In order to preserve privacy, data owners need
to make sure ||X̂ − X||F /||X||F is greater than the privacy threshold τ , specified by
users. Figure 3.11(a) shows the relationship between the expected privacy threshold
τ and the magnitude of noise need to be added. In general, the magnitude of noise
needs to be increased gradually when the privacy threshold τ is increased. However,
the relationship is not linear. For example, the relative noise strength ||E||F /||X||F
54
with type 1 noise
with type 2 noise
with type 3 noise
Information Loss
The strength of the noise (||E||F/||X||F)
Figure 3.10: Utility vs. varying noises of three types
remains unchanged when the expected privacy level is in the range (0.2, 0.4). As we
can recall from Section 3.3, the variance of the added noise is V ar(E) = λE /(m − 1),
where the eigenvalue λE should satisfy λXk+1 < λE ≤ λXk and k is determined by
2
2
)/||X||F }. In other words, the expected privacy
+ · · · + σX
k = max{i|τ ≤ (σX
n
i+1
level relies on the determination of k, which is influenced by the magnitude of noise
added. Although the relationship between the reconstruction bias and perturbation can
guide us to add noise into the original data set, the lower bound gives data owners the
worst case security assurance since it is bounded by any matrix B of rank no greater
than k derived by attackers. Figure 3.11(b) shows the relationship between the privacy
threshold specified by data owners and the real relative errors achieved by attackers using
both Strategy 1 and 2. For example, when the privacy threshold is specified as 0.4, the
real relative errors achieved by attackers using Strategy 1 and 2 are around 70% and 82%
respectively.
3.7.7
Evaluation of IQR Attack
In this section, we explain the breach of individual privacy by the information mined
from the perturbed data. To depict such breach, we use the measures defined in the
0
Strategy 1
Strategy 2
Achieved Relative Error (100%)
The Strength of the Noise (||E||F/||X||F )
55
70
60
50
40
30
20
10
0
0
5
Privacy Threshold
(a) Expected RE
10
15
20
Privacy Threshold
25
30
35
(b) Achieved RE
Figure 3.11: Achieved Reconstruction accuracy vs. varying privacy threshold τ
Table 3.4: Stock/bonds from Bank data set with Uniform noise [-125,125], disclosure
with 95% IQR, information loss for AS is 14.6%
Interval p %
35
40
45
50
55
60
65
70
75
80
no. of disclosed points(100%)
direct
IQR ideal
IQR with AS
13.9
21.2
3.5
16.0
32.5
15.1
17.9
43.0
29.6
19.8
52.9
41.8
22.0
62.9
53.2
23.9
72.9
63.4
26.0
83.3
73.5
28.0
94.3
83.7
29.9
99.9
94.5
32.0
100
100
D
ideal
0.605
0.66
0.712
0.763
0.814
0.864
0.916
0.972
0.999
1
AS
0.663
0.698
0.746
0.796
0.844
0.889
0.932
0.977
0.999
1
section 3.6.1 which is based on the estimate distribution of the original data. In our
experiments, the perturbed data is generated using both Uniform and Gaussian distributions. For uniform distribution, the random variable is generated from range [−α, α]
with mean 0. For Gaussian distribution, the random variable is generated with zero mean
and varying standard deviations. Please note that the spectral-filtering-based method
(SF) only works with Gaussian distribution while Agrawal and Srikant’s method (AS)
generally works with any distribution.
In Figure 3.12, we show a case in which we reconstructed stock/bonds distribution from
bank data set with the help of AS algorithm. The perturbing distribution is Uniform
distribution [-125, 125]. Figure 3.12(a) and 3.12(b) show the density distribution of the
56
10
6
5
4
6
Density (%)
Density (%)
8
4
3
2
2
1
0
40
60
80
100
120
Data Value
140
160
0
60
180
80
100
(a) Original
120
140
Data Value
160
180
200
(b) Estimated
2ULJLQDO
3HUWXUEHG
5HFRQVWUXFWHG
Density (%)
"
Data Value
(c) Reconstructed using AS
Figure 3.12: Reconstructed stock/bands from bank data set using AS algorithm. The
noise is Uniform distribution [-125,125].
original data and of reconstructed data using AS method respectively.
Figure 3.7.7 shows how the number of full disclosed points varies with the individual’s
privacy interval. Since we do not have real privacy interval for each individual data in
the original data set, we generate its privacy interval by using [xi (1 − P ), xi (1 + P )] for
each value xi of the Stock/Bond column in Bank data set. We experiment many P values
ranging from 35% to 80% in our experiments. The confidence threshold for the IQR is
taken to be 95% in all our experiments.
Recall that the individual data is said to be fully disclosed if it can be estimated with
95% confidence that the IQR interval [x1 , x2 ] covers the original value xi and its privacy
interval [wil , wiu ]. In this figure, the curve (labeled as Direct) shows the percentage of
points whose perturbed value completely lies in the individual’s privacy interval for each
varying P . We can see that they are less than 0.3 for all P values. In other words,
Number of Complete Disclosure Points (100%)
57
Direct
IQR Ideal
IQR with AS
Interval Size P (%)
Figure 3.13: Disclosure of Bank distribution with Uniform noise (AS algorithm)
Table 3.5: Sinusoidal with Gaussian noise (0,8) using AS and SF methods
Interval p %
60
65
70
75
80
85
90
95
100
direct
46.6
50.1
53.3
56.4
59.2
62.1
64.6
66.9
69.2
no. of disclosed points(100%)
IQR ideal
IQR with AS
IQR with SF
69.3
62.2
6
72.2
65.0
22.8
72.4
67.7
61.5
75.3
70.5
64.1
78.4
73.2
66.7
81.6
76.0
69.2
90.2
78.9
71.8
1
82.1
74.4
1
85.6
77.0
ideal
0.838
0.855
0.871
0.888
0.906
0.926
0.950
1
1
avg. D
AS
0.845
0.856
0.866
0.877
0.889
0.900
0.913
92.6
0.94
SF
0.699
0.735
0.875
0.882
0.889
0.897
0.905
91.3
0.922
perturbation seems to successfully preserve the other 70% individuals’ privacy. However,
the number of fully disclosed points is actually much larger when we apply IQR inference.
For example, when P = 0.6, 63.4% individual data is fully disclosed by using IQR. On
comparison, only 23.9% individual’s perturbation data lies in the original privacy interval.
In other words, 39.5% more individual’s data was fully disclosed using IQR inference.
In this figure, we also show the number of fully disclosed points when we apply IQR
inference on original distribution. This shows one ideal case where we can 100% accurately reconstruct the original distribution. We can see that the more accurate the
reconstructed distribution, the more individuals fully disclosed.
Number of Complete Disclosure Points (100%)
58
Direct
IQR Ideal
IQR with AS
IQR with SF
Interval Size P (%)
Figure 3.14: Disclosure analysis on Sinusoidal with Gaussian noise (0,8) using AS and
SF methods)
Table 3.7.7 shows details on number of disclosed points (direct, IQR ideal, IQR using
AS) with varied individual’s privacy interval size (determined by P ). We can see that
the number of fully disclosed points in all three cases increases when P increases. Table
1 also shows how average disclosure D varies with P . Recall that D in general measures
how close the IQR obtained by attackers or snoopers is from individual’s privacy interval.
We can see that D increases when P increases, which indicates more individual’s private
information is disclosed. The information loss, I, in terms of distribution, incurred by AS
algorithm is 14.6%, which suggests the AS algorithm can closely reconstruct the original
data.
Since spectral filtering based method only works with Gaussian distribution in principle, we use the second data set and introduce a relative large Gaussian noise in our
experiment. Figure 6 and Table 2 show our result on Sinusoidal signal. We perturbed
data using Gaussian noise with mean 0 and standard deviance 8. The IQR obtained from
AS, SF, and the original distribution is [1.6,4.2], [1.2,4.5], [2.05,3.9] respectively. We can
see that IQR inference on reconstructed distributions (AS and SF) can disclose more
individuals’ private information than that falsely assumed by perturbation methods. We
59
1
1
Direct
IQR with AS
0.98
0.8
0.96
0.7
Average D
Number of Disclosure Points (100%)
0.9
0.6
0.5
0.94
0.92
0.4
0.9
0.3
0.88
0.2
0.1
150
200
250
300
350
400
0.86
150
Noise Level
200
250
300
350
400
Noise Level
(a) Fully Disclosed Points vs. noise level
(b) average disclosure D vs. noise level
Figure 3.15: Stock/Bonds of Bank data set perturbed using Uniform distribution
also show the percentage of fully disclosed points and D under the ideal case where the
exact distribution is assumed to be reconstructed, AS case, and SF case. This experiment shows the SF method generally does not do as well as AS when the noise level
is large. However, even IQR based on reconstructed distributions from SF can disclose
more individual information than that falsely assumed by data owners. The information
loss for in terms of distribution incurred by AS and SF is is 32.9% and 47.0% respectively.
Since SF method is claimed to be able to reconstruct individual data, the corresponding
level of information loss in terms of individual data is ae = kX − X̃kF = 200.2 and
re =
kX−X̃kF
kXkF
= 0.375 respectively.
Figure 3.15 shows how disclosure varies with increasing noise for a stock/bonds variable in the bank data set. We perturb this data set using various uniform distributions
with range from 150 to 400. In Figure 3.15(a), we plot the number of fully disclosed
points (using direct and IQR with AS) with increasing deviation of the perturbed distribution while we plot the average disclosure D with increasing deviation of the perturbed
distribution in Figure 3.15(b). We can see that the greater the perturbation, the greater
amount of loss of information. The number of fully disclosed points using IQR decreases
along the increased information loss.
3.8
Summary
Additive Randomization has been a primary tool to hide sensitive private information
during privacy preserving data mining. The previous work based on Spectral Filtering
60
technique empirically showed that individual data can be separated from the perturbed
one and as a result privacy can be seriously compromised. In this chapter we conducted
the theoretical study on how the estimation error varies with the additive noise. In particular, before we conduct our bound analysis, we proposed a new strategy to determine
the principal components in spectral-filtering-based method. Our strategy is proven to
be more efficient than the previous one in terms of reconstruction error. To bound the
construction error, we first derived one upper bound for the Frobenius norm of reconstruction error using the matrix perturbation theory. This upper bound may be exploited
by attackers to determine how close their estimates are from the original data using spectral filtering based techniques, which imposes a serious threat of privacy breaches. We
then proposed one Singular Value Decomposition (SVD) based reconstruction method
and derived a lower bound for the reconstruction error. We then proved the equivalence
between the Spectral Filtering based approach and the proposed SVD approach and as
a result the achieved lower bound can also be considered as the lower bound of the Spectral Filtering based approach. This lower bound can help data owners determine how
much noise should be added to satisfy one given threshold of tolerated privacy breach.
In this chapter, we also discussed how to use a possible IQR-based attack to threaten
data provider’s privacy. A scenario where each data provider has specified his/her own
privacy interval was considered and corresponding privacy-quantification methods were
defined. Our experimental results showed the impact on the individual privacy by using
different reconstruction methods.
CHAPTER 4: DISCLOSURE ANALYSIS OF THE PROJECTION-BASED
PERTURBATION
In this chapter, our focus is on the projection-based approaches, which can be further classified as distance-preserving-based and non-distance-preserving-based. Distancepreserving-projection-based perturbation can mitigate the privacy/accuracy tradeoff by
achieving perfect data mining accuracy. Since the transformation matrix R is required to
be orthonormal (i.e., RRT = RT R = I), geometric properties (vector length, inner products and distance between a pair of vectors) are strictly preserved. Hence, data mining
results on the rotated data can achieve perfect accuracy. One known-sample-based PCA
attack was recently investigated to show the vulnerabilities of this distance-preservingbased projection approach when a sample data set is available to attackers [Liu, Giannella
and Kargupta 2006]. As a result, non-distance-preserving-based projection was suggested
be applied since it is resilient to the known-sample-based PCA attack with the sacrifice
of data mining accuracy to some extent.
However, one important issue is whether this approach is also subject to other specific
attacks. Intuitively, one might think that the Independent Component Analysis (ICA)
could be applied to breach the privacy. It was argued in [Chen and Liu 2005; Liu,
Kargupta and Ryan 2006] that ICA is in general not effective in practice due to two
basic difficulties in applying ICA directly to the projection-based perturbation. First,
there are usually significant correlations among attributes of X. Second, more than one
attributes may have Gaussian distributions.
To explore the vulnerability of this approach, we proposed an A-priori-Knowledge
ICA(AK-ICA) reconstruction method [Guo and Wu 2007], which may be exploited by
attackers when a small subset of sample data is available to attackers. The theoretical
analysis and empirical evaluation show AK-ICA can effectively recover the original data
62
with high precision when a part of sample data is a-priori known by attackers. Since the
proposed technique is quite robust with additive Gaussian noise and the transformation
matrix even with a small subset of sample data, it poses a serious concern for all previous
randomization-based privacy preserving data mining methods. It suggests all previous
projection-based approaches may be insecure to preserve privacy any more when a part
of sample data a-priori known by attackers.
The rest of this chapter is organized as follows. In Section 4.1 we introduce various projection-based perturbation models which can be classified as distance-preservingbased projection and non-distance-preserving-based projection. To integrate all the existing projection models, a general-linear-transformation-based perturbation model is also
introduced in this section. Potential attacks for projection-based perturbations are discussed in Section 4.2 and Section 4.3. In Section 4.2, we discuss direct ICA attack and its
drawbacks. A series of known-sample-based attacks are introduced in Section 4.3. Our
proposed attack, AK-ICA, as one of effective attacks, is emphasized in this part. We
also provide experimental results in Section 4.4 to show the performance of AK-ICA and
compare it with other attacking methods. We offer our concluding remarks in Section
4.5.
4.1
Projection-Based Perturbation Models
This section offers an overview of projection-based perturbation. We divide various
forms of projections into two categories: distance-preserving-based projection and nondistance-preserving-based projection. We fit all existing projection-based perturbations
into these categories. Further more, a general-linear-transformation-based perturbation
model is proposed to incorporate all above models.
4.1.1
Distance-Preserving-Based Projection
In [Chen and Liu 2005], the authors defined a Rotation-Based Perturbation Model, i.e.,
Y = RX, where R is a d × d orthogonormal matrix satisfying RT R = RRT = I.
Example 2 Following the previous example, we still use matrix X to express the original
data. The only difference over here is that each column of X represents one customer’s
63
record. The transformation matrix R in this example is a 3 × 3 random orthonormal
matrix. Then we get the perturbed data Y whose individual record in each column is
quite different from the one in the original data. So the privacy is expected to be well
preserved.
Y
= RX



0.667
0.667   10 15 50 45 ... 80 
 0.333





= 
 −0.667 0.667 −0.333   85 70 120 23 ... 110 



−0.667 −0.333 0.667
2 18 35 134 ... 15


63.67 110.00 119.67 ... 63.33 
 61.33



= 
49.33
30.67
55.00
−59.33
...
−31.67




−33.67 −21.33 −30.00 51.67 ... −51.67
The key features of rotation transformation are preserving vector length, Euclidean
distance and inner product between any pair of points, as shown below.
|Rx| = |x|
|R(xi − xj )| = |xi − xj |
< Rxi , Rxj > = < xi , xj >
(4.1)
where |x| = xT x represents the length of a vector x while < xi , xj >= xTi xj represents
the inner product of two vectors xi and xj .
Intuitively, geometric patterns such as cluster shape, hyperplane and hyper curved
surface in the multidimensional space will therefore be preserved. It can be shown from
the following example.
Example 3 Figure 4.1 demonstrates a perturbation on a data set with two dimensions.
From the original data set, two 
clusters can be clearly
identified. After the projection

 0.866 0.5 
with an orthonormal matrix R = 
, those two clusters can still be easily
−0.5 0.866
identified due to the preserved properties.
64
%
$
$’
Y2
X2
%’
X1
Y1
(a) Before the rotation
(b) After the rotation
Figure 4.1: Example of rotation-based perturbation
It was proved in [Chen and Liu 2005] that three popular classifiers (kernel method,
SVM, and hyperplane-based classifiers) are invariant to the rotation-based perturbation
due to the preserved geometric properties.
Authors in [Oliveira and Zaiane 2004] defined a Rotation-Based Data Perturbation
Function that distorts the attribute values of a given data matrix to preserve privacy of
individuals. In their approach, the attributes are processed pair by pair. For each pair
of selected attributes, the transformation matrix Rp is actually a 2 × 2 orthogonormal
matrix in the form of


 cos θ sin θ 
Rp = 

− sin θ cos θ
Their perturbation scheme can be expressed as Y = RX where R is a d × d matrix
with each row or column having only two no-zero elements, which represent the elements
in the corresponding Rp .
65

0
sin θ1
0
 cosθ1


...
...
...
...




0
− sin θ2
0
cos θ2


R=
...
...
...
...



 − sin θ1
0
cos θ1
0



...
...
...
...


0
cos θ2
0
sin θ2

... 


... 


... 


... 



... 


... 


...
It is easy to see that the perturbation matrix R here is an orthogonormal matrix when
there are even number of attributes. When we have odd number of attributes, according to their scheme, the remaining one is distorted along with any previous distorted
attribute, as long as some condition is satisfied.
4.1.2
Non-Distance-Preserving-Based Projection
In distance-preserving-based projection model, the transformation matrix R has to
be orthonormal in order to preserve distance and other geometric properties. However,
in non-distance-preserving-based projection, no such constrain is imposed on the transformation matrix. Without loss of generality, we consider R as an arbitrary random
matrix.
Example 4 The original data still come from Table 1.1. Different from Example 2, each
entry of the transformation matrix is randomly generated from an uniform distribution.
Y
= RX

 4.751

= 
 1.156

3.034

 43.63

= 
 158.16

322.43


2.429 2.282   10 15 50 45 ... 80 


 85 70 120 23 ... 110 
4.457 0.093 




3.811 4.107
2 18 35 134 ... 15

63.25 167.65 204.83 ... 220.68 

178.32 421.95 375.55 ... 526.70 


307.81 570.05 414.17 ... 536.23
66
Since the transformation matrix does not have to be orthonormal, distance as well
as other geometric properties might not be preserved any more. For the first customer,
the vector length of his/her original data is 85.61. However, the vector length of the
corresponding perturbed data is changed into 361.77.
Authors in [Liu, Kargupta and Ryan 2006] proposed a Random Projection-Based Perturbation Model and applied it for privacy preserving distributed data mining. The
random matrix Rd×d is generated such that each entry ri,j of R is independent and identically chosen from some normal distribution with mean zero and variance σr2 . Thus, the
following properties of the rotation matrix are achieved.
E[RT R] = dσr2 I
If two data sets X1 and X2 are perturbed as Y1 =
√ 1 RX1
dσr
and Y2 =
√ 1 RX2
dσr
perspec-
tively, then the inner product of the original data sets will be preserved from the statistic
point of view:
E[Y1T Y2 ] = X1T X2
Such model can also be extended to the case where R is a k × m transformation matrix.
To avoid the weakness of the rotation-based projection, authors in [Chen and Liu 2007]
gave an enhanced geometric perturbation model:
Y = RX + Ψ + ∆
In this model, two additional components are added: a random translation matrix Ψ and
a noise matrix ∆. Ψ is defined as Ψ = t1T , where t = [t1 , t2 , · · · , td ]T , 0 ≤ ti < 1, and
1 = [1, 1, · · · , 1]T . ∆ is defined as ∆ = [δ1 , δ2 , · · · , δN ], where a vector δi is d-dimensional
i.i.d. Gaussian variable.
Because of these two additional components, the rotation center shall be shifted and
geometric properties such as pair-wise distance, vector length and inner product might
not be preserved. Authors in [Chen and Liu 2007] also investigated possible attacks
for this model and analyzed a linear-regression-based attack when attackers know some
67
points in the original data, as well as the right mapping in the perturbed one.
4.1.3
The General-Linear-Transformation-Based Perturbation
The general-linear-transformation-based perturbation model can be described by
Y = RX + E
(4.2)
Where X ∈ Rp×n is the original data set consisting of n data records and p attributes.
Y ∈ Rq×n is the transformed data set consisting of n data records and q attributes. R
is a q × p rotation matrix while E ∈ Rq×n is a q × n noise matrix. In this chapter, we
shall assume R is a square matrix with dimension d for convenience (q = p = d). We
also assume the additive noise E is independent with the data X. This assumption is
generally held by almost all existing additive-noise-based perturbation methods. The
only exception is that Huang et al. in [Huang, Du and Chen 2005] proposed a modified
random perturbation, in which random noises are correlated with the original data, in
order to defeat PCA-based reconstruction methods. However, we argue that it may
significantly affect the accuracy of data mining results if correlated noises are added to
the original data.
Example 5 The transformation matrix R is chosen randomly as in Example 4. Besides, we also introduce some random noise E after the projection to get the data further
68
perturbed.
Y
= RX
 +E
 4.751

= 
 1.156

3.034

 7.334

 3.759


0.099

 265.87

= 
 394.35

362.59


2.429 2.282   10 15 50 45 ... 80 




4.457 0.093 
  85 70 120 23 ... 110  +


3.811 4.107
2 18 35 134 ... 15

4.199 9.199 6.208 ... 9.048 

7.537 8.447 7.313 ... 15.692 


7.939 3.678 1.939 ... 6.318

286.57 618.10 581.66 ... 690.55 

338.54 604.34 174.31 ... 599.84 


394.15 756.44 776.46 ... 729.85
The goal of this general-linear-transformation-based perturbation is to release Y for data
mining while preventing attackers to derive X. It combines both projection-based and
additive-noise-based approaches and all previous perturbation methods are special cases
of this general linear transformation model.
4.2
Direct Attack
Intuitively, one might think that the Independent Component Analysis (ICA) could
be applied to breach the privacy. When the original data is a collection of independent
signals and all of them are non-Gaussian distributed with the exception of one, ICA can
be used directly to break the privacy.
4.2.1
ICA Revisited
ICA is a statistical technique which aims to represent a set of random variables as
linear combinations of statistically independent component variables.
Definition 4.5 (ICA model) [Hyvarinen, Karhunen and Oja 2001]
ICA of a random vector x = (x1 , · · · , xm )T consists of estimating of the following
generative model for the data:
x = As
69
or
X = AS
where the latent variables (components) si in the vector s = (s1 , · · · , sn )T are assumed
independent. The matrix A is a constant m × n mixing matrix.
The basic problem of ICA is to estimate both the mixing matrix A and the realizations of the independent components si using only observations of the mixtures xj . The
following three restrictions guarantee identifiability in the ICA model.
1. All the independent components si , with the possible exception of one component,
must be non-Gaussian.
2. The number of observed linear mixtures m must be at least as large as the number
of independent components n.
3. The matrix A must be of full column rank.
The second restriction, m ≤ n, is not completely necessary. Even in the case where
m < n, the mixing matrix A is identifiable whereas the realizations of the independent
components are not identifiable, because of the noninvertibility of A. In this chapter, we
make the conventional assumption that the dimension of the observed data equals the
the number of independent components, i.e., n = m = d. Please note that if m > n,
the dimension of the observed vector can always be reduced so that m = n by existing
methods such as PCA.
The couple (A, S) is called a representation of X. Since X = AS = (AΛP )(P −1 Λ−1 S)
for any diagonal matrix Λ (with nonzero diagonals) and permutation matrix P , X can
never have completely unique representation.
The reason is that, both S and A being unknown, any scalar multiplier in one of the
sources si could always be canceled by dividing the corresponding column ai of A by
the same scalar. As a consequence, we usually fixes the magnitudes of the independent
components by assuming each si has unit variance. Then the matrix A will be adapted in
the ICA solution methods to take into account this restriction. However, this still leaves
70
the ambiguity of the sign: we could multiply an independent component by −1 without
affecting the model. This ambiguity is insignificant in most applications.
4.2.2
Drawbacks of Direct ICA
It was argued in [Chen and Liu 2005; Liu, Kargupta and Ryan 2006] that ICA is in
general not effective in breaking the rotation-based perturbation in practice due to two
basic difficulties in applying the ICA attack directly to the rotation-based perturbation.
First, there are usually significant correlations among attributes of X. Second, more
than one attribute may have Gaussian distributions. We would emphasize that these
two difficulties are generally held in practice.
Example 6 We show the correlation matrix of a bank data set with five attributes as
below.


1.000 0.631 0.756 0.501 0.218


 0.631 1.000 0.723


Corr(X) = 
 0.756 0.723 1.000


 0.501 0.357 0.237

0.218 0.137 0.112


0.357 0.137 


0.237 0.112 



1.000 0.440 

0.440 1.000
From Example 6, we can observe significant correlations exist among attributes. To
measure the non-Gaussianity, we apply the classic kurtosis measure
Pn
Kurt(z) =
− z̄)4
−3
(n − 1)σ 4
i=1 (zi
where z̄ is the mean and σ is the standard deviation, and n is the number of data points.
The kurtosis is based on the fourth-order statistics and the kurtosis for a standard normal
distribution is zero. Below shows the kurtosis of 5 attributes in the bank data set.
µ
Kurt(X) =
¶
−0.025 6.141 2.867 −0.025 170.593
It is easy to see that attribute 1 and 4 tend to be Gaussian distributed. One explanation why many attributes in practice have Gaussian distributions is that attributes in
71
databases usually are a combination of some other hidden attributes. According to the
Central Limit Theorem, the distribution of a sum of independent random variables tends
towards a Gaussian distribution.
4.3
Sample-Based Attack
Different customers may have different requirements/concerns on their individual data
and some customers may have very few concerns on their privacy. Practically, it gives
a chance to attackers to collect some individual data and launch their attacks based on
the knowledge of those data samples. In this section, we will introduce several potential
attacks under such scenario that a small data sample from the same population of X is
available to attackers, denoting as X̃.
4.3.1
Attacks for Distance-Preserving-Based Projection
In the distance-preserving-based projection model, the transformation matrix is orthonormal: RT R = RRT = I. It seems that privacy is well preserved after rotation, however,
a small known sample may be exploited by attackers to breach privacy completely.
Known-Sample-Based Regression Attack
Let’s consider a case where X ∩ X̃ = X ‡ 6= ∅. It indicates that a subset of the original
data is already known by attackers. Since many geometric properties (e.g. vector length,
distance and inner product) are preserved, attackers can easily locate X ‡ ’s corresponding
part, Y ‡ , in the perturbed data set by comparing those values. From Y = RX, we know
the same linear transformation is kept between X ‡ and Y ‡ : Y ‡ = RX ‡ . Once the size of
X ‡ is at least rank(X), the transformation matrix R can be perfectly recovered through
linear regression. We call it as a linear-regression-based attack.
For Example 3 in the Section 4.1.1, only 2 points are needed to break the privacy of
the whole data set. We may take points A and B as two customers’ data which were
known by attackers. Therefore, the vector lengths and distance between them can be
calculated: |A| = 1.8055, |B| = 2.1819 and |A − B| = 0.7179. The preserved geometric
properties in the perturbed data can easily help attackers to locate the corresponding
transformed points A0 and B 0 by comparing those calculated key values. Finally, a linear
72
regression using A, B, A0 and B 0 will derive the transformation matrix R in this example.
Known-Sample-Based PCA Attack
For the case where X ‡ = ∅ or too small, attackers may have no or little information about
the exact data in X. However, the known sample(X̃) is drawn from the same population
as the original data X. By taking advantage of the distribution learnt from the sample,
authors in [Liu, Giannella and Kargupta 2006] proposed a Known-Sample-Based PCA
Attack. The idea is briefly given as follows. Since the known sample and private data
share the same distribution, eigenspaces (eigenvalues) of their covariance matrices are
expected to be close to each other. As we known, the transformation here is a geometric
rotation which does not change the shape of distributions (i.e., the eigenvalues derived
from the sample data are close to those derived from the transformed data). Hence, the
rotation angels between the eigenspace derived from known samples and those derived
from the transformed data can be easily identified. In other words, the rotation matrix
R is recovered.
This attack is formally supported by the following theorem in [Liu, Giannella and
Kargupta 2006]. To be consistent, we use notations defined in this chapter. Figure 4.2
gives the complete attacking procedure.
Theorem 4.5 The eigenvalues of ΣX and ΣY are the same(ΛX = ΛY ) and the transformation on X and Y is the transformation on the corresponding eigenvectors(RQX =
QY D), where R is the orthonormal transformation matrix applied in the perturbation,
D is a diagonal matrix with diagonal entries be either 1 or -1, ΣX and ΣY are covariance matrices of X and Y which have eigenvalue decompositions as ΣX = QX ΛQTX and
ΣY = QY ΛQTY respectively.
4.3.2
Attacks for Non-Distance-Preserving-Based Projection
The important step in the regression-based attacking method shown above is to locate
the sample’s corresponding perturbed data from Y . Those preserved geometric properties
in the distance-preserving-based projection model can help attackers to do so. However,
for non-distance-preserving-based projection model and the general model with random
73
input
Y , n × m matrix, a given perturbed data set
X̃, n × p matrix, a given subset of original data
output X̂, an estimation of the original data set
BEGIN
1
Computing covariance matrix ΣX̃ from X̃ and covariance matrix ΣY from Y .
2
Doing eigenvalue decomposition on ΣX̃ and ΣY to get their eigenvectors.
ΣX̃ = QX̃ ΛX̃ QTX̃
ΣY = QY ΛY QTY
3
Choosing D = argmaxG(QY DQTX̃ X̃, Y ), where G is a function to test the
similarity between distributions of two data set.
4
Estimating X as: X̂ = QX̃ DQTY Y
END
Figure 4.2: Known-Sample-Based PCA Attack
transformation matrix, R, those properties might not be preserved any more.
In this part we present an effective attack which may be exploited by attackers. Although we can not apply ICA directly to estimate X from the perturbed data, we will
show that there exists a possible attacking method based on ICA when a subset of the
original data is available.
AK-ICA Attack
In [Guo and Wu 2007], we showed that attackers can reconstruct X closely by applying an
proposed A-priori-Knowledge ICA based attack (AK-ICA) when a (even small) sample
of data, X̃, is available to attackers. Let X̃ ⊂ X be this sample data set consisting of k
data records and d attributes.
The core idea of AK-ICA is to apply the traditional ICA on the known sample data set,
X̃, and perturbed data set, Y , to get their mixing matrices and independent components
respectively, and reconstruct the original data by exploiting the relationships between
them. Figure 4.3 shows the procedure of such attack.
The first step of this attack is to derive ICA representations, (Ax̃ , Sx̃ ) and (Ay , Sy ),
from the a-priori known subset X̃ and the perturbed data Y respectively. Since in
general we can not find the unique representation of (A, S) for a given X (recall that
74
input
Y , a given perturbed data set
X̃, a given subset of original data
output X̂, a reconstructed data set
BEGIN
1
Applying ICA on X̃ and Y to get
X̃ = Ax̃ Sx̃
Y = Ay S y
2
Deriving the transformation matrix J by comparing the distributions of
Sx̃ and Sy
3
Reconstructing X approximately as
X̂ = Ax̃ JSy
END
Figure 4.3: AK-ICA Attack
X = AS = (AΛP )(P −1 Λ−1 S) for any diagonal matrix Λ and perturbation matrix P in
Section 4.2.1), S is usually required to have unit variance to avoid scale issue in ICA.
As a consequence, only the order and sign of the signals S might be different. In the
following, proofs are given to show that there exists a transformation matrix J such that
X̂ = Ax̃ JSy is an estimate of the original data X. We also present how to identify J
with an example.
Existence of Transformation Matrix J
To derive the permutation matrix J, let us first assume X is given. Applying the independent component analysis, we get X = Ax Sx where Ax is the mixing matrix and Sx is
independent signal.
Corollary 2 The mixing matrices Ax , Ax̃ are expected to be close to each other and the
underlying signals Sx̃ can be approximately regarded as a subset of Sx .
Ax̃ ≈ Ax Λ1 P1
Sx̃ ≈ P1−1 Λ−1
1 S̃x
(4.3)
75
Proof. Considering an element xij in X, it is determined by the i-th row of Ax , ~ai ,
and the j-th signal vector, ~sj , where ~ai = (ai1 , ai2 , · · · , aid ) and ~sj = (s1j , s2j , · · · , sdj )T .
xij = ai1 s1j + ai2 s2j + · · · + aid sdj
Let ~x̃p be a column vector in X̃ which is randomly sampled from X. Assume ~x̃p = ~xj ,
then the i-th element of this vector, x̃ip can also be expressed by ~ai and the corresponding
signal vector ~sj .
x̃ip = ai1 s1j + ai2 s2j + · · · + aip spj
Thus, for a given column vector in X̃, we can always find a corresponding signal vector
in S and reconstruct it through the mixing matrix Ax . Since Sx is a set of independent
components, its sample subset S̃x ⊂ Sx can also be regarded as a set of independent
components of X̃ when the sample size of X̃ is large.
There exists a diagonal matrix Λ1 and a permutation matrix P1 such that
X̃ = Ax̃ Sx̃ ≈ Ax S̃x = (Ax Λ1 P1 )(P1−1 Λ−1
1 S̃x )
Ax̃ ≈ Ax Λ1 P1
Sx̃ ≈ P1−1 Λ−1
1 S̃x
Corollary 3 Sx and Sy are similar to each other and there exists a diagonal matrix Λ2
and a permutation matrix P2 that
Sy = P2−1 Λ−1
2 Sx
Proof.
Y
= RX = R(Ax Sx ) = (RAx )Sx
Since permutation may affect the order and phase of the signals Sy , we have
Y = Ay Sy ≈ (RAx Λ2 P2 )(P2−1 Λ−1
2 Sx )
(4.4)
76
By comparing the above two equations, we have
Ay ≈ RAx Λ2 P2
Sy ≈ P2−1 Λ−1
2 Sx
Theorem 4.6 Existence of J. There exists one transformation matrix J such that
X̂ = Ax̃ JSy ≈ X
(4.5)
where Ax̃ is the mixing matrix of X̃ and Sy is the independent components of the
perturbed data Y .
Proof Since
˜
Sx̃ ≈ P1−1 Λ−1
1 Sx
Sy ≈ P2−1 Λ−1
2 Sx
and S̃x is a subset of Sx , we can find a transformation matrix J to match the independent
components between Sy and Sx̃ . Hence,
JP2−1 Λ−1
= P1−1 Λ−1
2
1
J = P1−1 Λ−1
1 Λ2 P2
From Equation 4.3 and 4.4 we have
X̂ = Ax̃ JSy
−1 −1
≈ (Ax Λ1 P1 )(P1−1 Λ−1
1 Λ2 P2 )(P2 Λ2 Sx )
= Ax S x
= X
Determining J
The ICA model given in Definition 4.5 implies no ordering of the independent components. The reason is that, both s and A being unknown, we can freely change the order
of the terms in the sum in Definition 4.5, and call any of the independent component as
77
the first one. Formally, a permutation matrix P and its inverse can be substituted in the
model to give another solution in another order. As a consequence, in our case, the i-th
component in Sy may correspond to the j-th component in Sx̃ . Hence we need to figure
out how to find the transformation matrix, J.
Since Sx̃ is a subset of Sx , each pair of corresponding components follow similar distributions. Hence our strategy is to analyze distributions of two signal data sets, Sx̃ and Sy .
As we discussed before, the signals derived by ICA are normalized signals. So the scaler
for each attribute is either 1 or -1. It also can be easily indicated by the distributions.
(i)
(j)
Let Sx̃ and Sy denote the i-th component of Sx̃ and the j-th component of Sy and
0
let fi and fj denote their density distribution respectively. In this chapter, we use the
information difference measure I to measure the similarity of two distributions [Agrawal
and Agrawal 2001].
1
I(fi , fj ) = E[
2
Z
0
0
ΩZ
| fi (z) − fj (z) | dz]
(4.6)
The above metric equals half the expected value of L1 -norm between the distribution
of the i-th component from Sx̃ and that of the j-th component from Sy . It is also equal
to 1 − α, where α is the area shared by both distributions. The smaller the I(f, f 0 ),
the more similar between one pairs of components. The matrix J is determined so that
0
0
0
J[f1 , f2 , · · · , fd ]T ≈ [f1 , f2 , · · · , fd ]T .
In the following, we illustrate how it works using an example.
Example 7 The data set X in this example contains 5 attributes and 50,000 records.
From X, 1,000 records are randomly extracted, denoting as X̃. The original data X is
also perturbed by Y = RX.
Applying ICA on X̃ and Y , we get their ICA representation (Ax̃ , Sx̃ ) and (Ay , Sy ) re(i)
(j)
spectively. For each pair of components between Sx̃ and Sy , we apply the information
difference as shown in Equation 4.6 to compute their similarity. The derived transfor-
78
0. 4
Density
Density
0. 4
0. 2
0. 2
0
0
- 7. 35
- 6. 30 - 5. 25
- 4. 20 - 3. 15
F1
- 2. 10 - 1. 05
- 9. 7 - 8. 5 - 7. 3 - 6. 1 - 4. 9 - 3. 7 - 2. 5 - 1. 3 - 0. 1
0. 00
F2
(1)
(2)
(a) Sx̃ : component 1
(b) Sy
1. 1
component 2
Figure 4.4: Distribution of component
mation matrix J is shown as:


0 1
0
0
(3)
(4)
(3)
(3)


 1 0 0
0


J =
 0 0 −1 0


0
 0 0 0

0 0 0 −1
(1)
(2)
0


0 


0 



1 

0
(5)
which means the components, [Sx̃ , Sx̃ , Sx̃ , Sx̃ , Sx̃ ], derived from X̃ correspond to
(2)
(1)
(3)
(5)
(4)
(5)
(4)
[Sy , Sy , Sy , Sy , Sy ] respectively and (Sx̃ , Sy ) and (Sx̃ , Sy ) have -1 scalar differ(1)
(2)
ence. Figure 4.4 shows the density distributions of two matched components (Sx̃ , Sy ).
When we have X, its ICA representation (Ax , Sx ) can also be derived. From Corollary
1 and 2, we can get P1−1 Λ−1
1 and Λ2 P2 as follows.


0 1 0
P1−1 Λ−1
1
0

0




 1 0 0 0

0





= 0 0 1 0
0 





 0 0 0 0 −1 


0 0 0 −1 0

1 0
Λ2 P2
0
0
0


 0 1 0 0 0


= 
 0 0 −1 0 0


 0 0 0 1 0

0 0 0 0 −1











79
We can easily check the derived transformation matrix J equal to P1−1 Λ−1
1 Λ2 P2 .
4.3.3
Attacks for General Projection
The general-linear-transformation-based perturbation model is an integration of additivenoise-based perturbation model and projection-based perturbation model. Therefore, it
absorbs some properties of both models and brings more randomness to the protected
data.
As previous works [Huang, Du and Chen 2005; Kargupta et al. 2003] indicated, the
Spectral Filtering or PCA-based method works very well for the additive-noise-based
perturbation. However, it is impossible to breach the privacy from the projection-based
perturbation since spectral properties can hardly be preserved in the perturbed data,
especially when the original data is projected to a lower dimensional space. Thus, we
can not filter out the noise by extracting eigenvectors(eigenvalues) of the original X from
the rotated data Y . It is also hard to derive the linear transformation matrix in nondistance-preserving-based projection model by simply analyzing the spectral information
either.
Since noise-free ICA model can be extended to a noisy-ICA model, the AK-ICA can
also be extended to attack the general-linear-transformation-based perturbation by solving noisy-ICA problem.
Definition 4.6 (Noisy ICA model) [Hyvarinen, Karhunen and Oja 2001]
ICA of a random vector x = (x1 , · · · , xm )T consists of estimating of the following
generative model for the data:
x = As + e
or
X = AS + E
where the latent variables (components) si in the vector s = (s1 , · · · , sn )T are assumed
independent. The matrix A is a constant m × n mixing matrix, and e is a m-dimensional
random noise vector.
80
PCA, as well as Spectral Filtering, is a purely second-order statistical method: only
covariances between the observed variables are used in the estimation. This is due to
the assumption of Gaussianity of the components. The components are further assumed
to be uncorrelated, which also implies independence in the case of Gaussian data. On
the contrary, components in ICA are assumed to be statistically independent and nonGaussian.
The spectral filtering method [Kargupta et al. 2003] based on random matrix theory
and PCA-based reconstruction method [Huang, Du and Chen 2005] can only be used to
approximately reconstruct the private data from the additive-noise-based perturbation
(i.e., Y = X + E). The reason is that both methods utilize the spectral properties of
the randomized data to separate additive noise from the original data. To separate the
additive noise from the perturbed data, some properties of E (e.g., covariance matrix)
must be known. Even though it is possible to remove a part of the noise on some
dimensions, the spectral properties are helpless to the rotation. On the contrary, the
AK-ICA attack can effectively reconstruct the data from the general linear transformed
data (i.e., Y = RX + E and R can be any rotation matrix) when only a small subset of
sample is available to attackers. The attackers do not even need any information of E.
Experiments in the next section will show that the AK-ICA can even outperform
the spectral-filtering-based method for the additive-noise-based perturbation when the
strength of E is large. In other words, spectral filtering or PCA-based methods are
usually not robust with relative large noises. The AK-ICA method is expected to be
more robust with Gaussian noise since the objective functions used in ICA methods are
higher-order statistics (e.g., Kurtosis).
Both ICA and PCA methods formulate a general objective function that define the
interestingness of a linear representation, and then maximize that function. Both are
related to factor analysis, though under the contradictory assumptions of Gaussianity
and non-Gaussianity, respectively. PCA uses only second-order statistics, while ICA can
use both second and fourth-order cumulants.
81
4.4
Evaluation
In our AK-ICA method, we applied JADE package3 implemented by Jean-Francois
Cardoso to conduct ICA analysis. JADE is one cumulant-based batch algorithm for
source separation [Cardoso 1999].
Since our AK-ICA attack can reconstruct individual data in addition to its distribution,
in this study we cast our accuracy analysis in terms of both matrix norm and individualwise errors. We measure the reconstruction errors using the following measures.
d
n
1 X X xij − x̂ij
RE =
|
|
d × n i=1 j=1
xij
n
RE-Ri
1 X xij − x̂ij
|
=
|
n j=1
xij
RE-Cj
1 X xij − x̂ij
|
=
|
d i=1
xij
i = 1, · · · , d
d
j = 1, · · · , n
F -AE(X, X̂) = kX̃ − XkF
F -RE(X, X̂) =
kX̃ − XkF
kXkF
where X, X̂ denotes the original data and the estimated data respectively, and k · kF
denotes a Frobenius norm 4 .
All the above measures show how closely one can estimate the original data X from
its perturbed data Y . Here we follow the tradition of using the difference as the measure
to quantify how much privacy is preserved. Basically, RE (relative error) represents the
average of relative errors of individual data points. RE-Ri represents the average of
relative errors of the i-th attribute while RE-Cj represents the average of relative errors
of the j-th record. Since we have 50,000 records in our bank data set, we only list the
minimum and maximum values of RE-Cj in our results (see Table 4.1, 4.2 and 4.4) .
However, we would point out that RE-Cj is an important measure since it shows privacy
breach of each individual customer. F -AE (F -RE) denotes the absolute (relative) errors
3
4
http://www.tsi.enst.fr/icacentral/algos.html
qP
Pn
d
2
The Frobenius norm of X: kXkF =
i=1
j=1 xij .
82
between X and its estimation X̂ in terms of Frobenius norm, which gives perturbation
evaluation a simplicity that makes it easier to interpret.
4.4.1
Effect of Noise and the Transformation Matrix
In this experiment, we evaluate the performance of AK-ICA on the general scenario
Y = RX + E. For this scenario, both R and E determine the perturbation of the original
data. We will see how the reconstruction accuracy is affected by these two factors by
changing the strength of noise E in following four cases with various transformation
matrices.
• Case 1: R = I
• Case 2: RT R = I
• Case 3: R1 is a random matrix with det(R1) = 0.444, kR1kF = 3.167.
• Case 4: R2 is another random matrix with det(R2) = 2.48 × 109 , kR2kF = 281.8.
For the Case 3 and Case 4, both R1 and R2 were generated randomly with significantly
different determinant and Frobeniums norm values. We apply the term Signal-to-Noise
Ratio (SNR) to quantify the relative amount of noise added to actual data.
SN R = 20log
kXkF
kEkF
In all four cases, we have 1000 known samples and introduce additive Gaussian noises
from 20db to -5db in terms of SNR. The SNR of 0db means that the noise variance equals
to the original data.
In Figure 4.5, we plot the reconstruction error (RE) with increasing SNR for all the
above four cases. We include complete results in terms of various reconstruction error
measures for the above four cases in Table 4.1. In all four cases, we can observe that the
proposed AK-ICA based construction method is quite robust when a small or medium
noise is added. However, with the large noise (SNR < 5db), the reconstruction accuracy is
noticeably degraded. We would explore further how ICA methods based on higher-order
statistics are affected by the large Gaussian noise in the future.
case 4
random R2
case 3
random R1
case 2
RT R = I
case 1
R=I
Y = RX + E
F -AE
3641
3387
3390
5285
9283
14538
3493
3371
3775
6232
15808
19067
3415
3787
3576
5448
14018
12895
3453
3495
3574
5438
13249
12349
SNR
20
15
10
5
0
-5
20
15
10
5
0
-5
20
15
10
5
0
-5
20
15
10
5
0
-5
0.108
0.100
0.100
0.156
0.274
0.430
0.103
0.099
0.112
0.184
0.468
0.464
0.101
0.112
0.106
0.161
0.415
0.382
0.102
0.103
0.106
0.161
0.392
0.365
F -RE
0.1362
0.124
0.118
0.127
0.230
0.464
0.133
0.134
0.151
0.235
0.318
0.358
0.125
0.143
0.128
0.188
0.281
0.373
0.129
0.131
0.127
0.189
0.273
0.443
RE
1
0.072
0.067
0.071
0.152
0.265
0.415
0.068
0.065
0.075
0.152
0.433
0.312
0.067
0.076
0.075
0.133
0.418
0.369
0.068
0.069
0.073
0.132
0.426
0.417
2
0.102
0.101
0.093
0.086
0.122
0.513
0.099
0.088
0.070
0.084
0.243
0.304
0.104
0.097
0.093
0.089
0.166
0.217
0.103
0.102
0.089
0.093
0.161
0.161
RE-Ri
3
0.127
0.112
0.095
0.129
0.232
0.200
0.118
0.097
0.079
0.107
0.470
0.394
0.120
0.124
0.098
0.115
0.406
0.319
0.118
0.112
0.104
0.114
0.382
0.068
4
0.331
0.285
0.256
0.158
0.290
0.766
0.327
0.348
0.413
0.621
0.243
0.396
0.291
0.352
0.292
0.443
0.202
0.316
0.311
0.312
0.289
0.442
0.190
1.272
5
0.049
0.055
0.075
0.113
0.242
0.423
0.055
0.074
0.116
0.209
0.199
0.382
0.044
0.064
0.082
0.162
0.213
0.144
0.045
0.057
0.079
0.168
0.204
0.297
Table 4.1: Reconstruction error vs. SNR for four cases when k = 1000
RE-Cj
min max
0.053 8.191
0.034 6.149
0.037 1.163
0.034 2.193
0.0687 1.839
0.164 1.934
0.043 6.103
0.045 3.228
0.037 1.013
0.060 8.784
0.078 22.85
0.231 1.443
0.052 1.013
0.049 1.169
0.043 1.213
0.053 1.264
0.066 1.265
0.07 2.060
0.045 1.039
0.048 1.374
0.046 0.872
0.053 1.576
0.062 1.438
0.109 2.984
83
84
R=I
RR 7 = I
R
5
R
RE
0.
0.
2
0.
0
−5
0
5
10
15
20
SNR (db)
Figure 4.5: The effect of noise E for RE
From Table 4.1 and Figure 4.5 , we may also observe the proposed AK-ICA reconstruction method is very stable across all the four cases, which means it is insensitive to
the selection of rotation matrix R. In other words, once attackers have a sample subset
X̃, they can always get similar estimates no matter how database owners want to change
Y by choosing different R. This is because Sy is stable and close to the Sx since ICA
technique itself is robust to the rotation matrix R. This is a major advantage over the
spectral filtering or PCA-based reconstruction methods since they can only deal with the
additive noise case (R = I).
4.4.2
Effect of the Sample Size
In this experiment, we work on the scenario Y = RX without additive noise E. We
also fix the rotation matrix R as orthogonormal matrix to satisfy RT R = I. It can be
randomly generated based on Haar distribution [Stewart 1980]. We change the sample
size k of X̃ from 20 to 2000. Please note that all chosen k values are small compared
with the size of the original data.
Figure 4.6 shows the reconstruction error (in terms of F -RE and RE in Figure 4.6(a)
and RE-Ri for each attribute in Figure 4.6(b)) decreases when the sample size k is
F -AE
12791
8574
7748
7304
7224
6463
5444
5225
3829
3458
2297
Sample Size
20
50
80
100
200
300
400
500
800
1000
2000
0.378
0.254
0.231
0.216
0.214
0.191
0.161
0.155
0.113
0.102
0.068
F -RE
0.372
0.286
0.238
0.225
0.198
0.187
0.170
0.126
0.128
0.128
0.094
RE
1
0.529
0.068
0.101
0.104
0.238
0.088
0.147
0.058
0.073
0.068
0.042
2
0.075
0.222
0.314
0.054
0.259
0.049
0.113
0.097
0.034
0.105
0.009
RE-Ri
3
0.112
0.372
0.233
0.309
0.116
0.294
0.133
0.263
0.045
0.125
0.041
4
1.019
0.458
0.387
0.435
0.293
0.418
0.298
0.194
0.230
0.301
0.250
5
0.122
0.311
0.153
0.225
0.084
0.087
0.158
0.014
0.259
0.038
0.128
RE-Cj
min max
0.192 9.024
0.181 4.267
0.093 5.209
0.144 3.233
0.094 4.898
0.092 5.244
0.079 4.245
0.095 0.998
0.087 0.848
0.059 0.970
0.076 1.351
Table 4.2: Reconstruction error vs. sample size(k) when Y = RX
85
86
0.4
F−RE
RE
Attribute
Attribute
Attribute
Attribute
Attribute
1
0.35
0.8
0.3
RE-Ri
0.25
0.2
0.6
0.4
0.15
0.2
0.1
0.05
0
500
1000
1500
0
2000
0
500
1000
Sample size K
1500
2000
Sample size K
(a) F-RE and RE
(b) RE-Ri
Figure 4.6: Reconstruction error vs. varying known sample size k under Y = RX
0.5
Attribute
Attribute
Attribute
Attribute
Attribute
F−RE
RE
0.45
1.2
0.4
1
RE−R i
0.35
0.3
0.8
0.5
0.25
0.4
0.2
0.2
1
2
3
4
5
6
7
8
9
10
Round (k = 50)
(a) F-RE and RE
0
1
2
3
4
5
6
7
8
9
10
Round
(b) RE-Ri
Figure 4.7: Reconstruction error vs. random samples with the fixed size k = 50
increased. The similar trend also holds for RE-Cj for each record, which we show the
minimum and maximum values in Table 4.2. This is because that the more sample data
we have, the more match between derived independent components. When we have 500
known records (which account for 1% of the original data), we could achieve very low
reconstruction error (F -RE = 0.155, RE = 0.126). When the sample size is decreased,
more errors are introduced. However, even with only 20 known samples (which account
for 0.04% of the original data), we can still achieve very close estimations for some
attributes (e.g., RE-Ri = 0.075 for attribute 2).
87
Especially, when k is small, we also evaluate how different sample sets X̃ with the
same size k affect AK-ICA reconstruction method. Here we randomly chose 10 different
sample sets with the fixed size k = 50. Figure 4.7 shows the construction errors with
10 different sample sets. The performance of our AK-ICA reconstruction method is not
very stable in this small sample case. For example, the first run achieves 0.1 of F -RE
while the third run achieves 0.44 as shown in Figure 4.7(a). The instability here is mainly
caused by Ax̃ which is derived from X̃. Since Y = RX +E is fixed, the derived Sy doesn’t
change.
We also observed that for each particular attribute, its reconstruction accuracy in different rounds is not stable either. As shown in Figure 4.7(b), the attribute 5 has the
largest error among all the attributes in round 5, however, it has the smallest error in
round 7. This is because the reconstruction accuracy of one attribute is mainly determined by the accuracy of its estimate of he corresponding column vector in Ax̃ . This
instability can also be observed in Figure 4.6(b).
4.4.3
Comparing AK-ICA and Known-Sample-Based PCA Attack
In this experiment, we evaluate the reconstruction performance of AK-ICA and the
known-sample-based PCA attack in [Liu, Giannella and Kargupta 2006]. Since the
known-sample-based PCA attack could not handle the additive noise, we compare these
two attacking methods on the scenario Y = RX with no noise involved. We fix the
sample ratio as 1% and apply different transformation matrices. Here R is expressed
as R = R1 + cR2 , where R1 is a random orthonormal matrix, R2 is a random matrix
with uniformly distributed elements([-0.5,0.5]) and c is a coefficient. Initially, c is set as
0 which guarantees the orthonormal property for R. By increasing c, R gradually loses
orthonormal property and tends to be an arbitrary transformation.
From Figure 4.8(a) and 4.8(b) we can observe that our AK-ICA attack is robust
to various transformations. The reconstruction errors do not change much when the
transformation matrix R is changed to more non-orthonormal. On the contrary, the
PCA attack only works when R is orthonormal or close to orthonormal. When the
transformation tends to be more non-orthonormal (with the increase of c as shown in
88
Table 1), the reconstruction accuracy of PCA attack degrades significantly. For example,
when we set c = 5, the relative reconstruction errors of PCA attack are more than 200%
(F -RE=2.1414 , RE = 2.1843) while the relative reconstruction errors of AK-ICA attack
are less than 20% (F -RE=0.1444 , RE = 0.1793).
2.5
2.5
ICA Attack
PCA Attack
2
2
1.5
RE
F−RE
1.5
1
1
0.5
0.5
0
0
−0.5
ICA Attack
PCA Attack
0
1
2
3
c
(a) F-RE
4
5
−0.5
0
1
2
3
4
5
c
(b) RE
Figure 4.8: Reconstruction error of AK-ICA vs. PCA attacks by varying R
4.4.4
Comparing AK-ICA and Spectral-Filtering-Based Attack
To compare AK-ICA with the spectral-filtering-based method [Kargupta et al. 2003],
we choose additive-noise-based perturbation model with no projection (Y = X + E). As
we introduced in Chapter 3, the spectral filtering method assumes the covariance matrix
of E be given in order to separate the principal components from the perturbed data.
The reconstruction accuracy of spectral-filtering-based method is mainly determined by
how well the principal components can be separated from the perturbed data. Table
4.4 shows how the spectral filtering method works when we vary the strength of additive
noise E from 20db to -5db. It shows the reconstruction accuracy is decreased significantly
when a large noise is introduced.
We plot the comparison of reconstruction accuracy with increasing the strength of E
between AK-ICA and the spectral filtering in Figure 4.9. We assume 1, 000 records are
available to attackers. We can see the spectral filtering method outperforms AK-ICA
when an relatively small noise (SNR > 5db ) is introduced while AK-ICA outperforms
||cR2 ||F
||R1 ||F
0
0.1299
0.1988
0.3121
0.3011
0.4847
0.539
0.804
c
0
0.2
0.3
0.4
0.5
0.7
1
1.25
AK-ICA
F -RE
RE
0.0824 0.1013
0.1098 0.1003
0.0701 0.0618
0.1336 0.1631
0.1867 0.2436
0.1227 0.1188
0.065 0.0606
0.1177 0.1399
PCA
F -RE
RE
0.013 0.0126
0.0451 0.0448
0.1288 0.1247
0.1406 0.1305
0.1825 0.1704
0.2415 0.2351
0.35
0.334
0.5565 0.5695
1.5
2
2.5
3
3.5
4
4.5
5
c
0.8059
1.2755
1.5148
1.9321
2.1238
2.4728
3.049
3.4194
||cR2 ||F
||R1 ||F
AK-ICA
F -RE
RE
0.1533 0.169
0.1709 0.1523
0.0816 0.1244
0.1142 0.1373
0.1303 0.1566
0.1249 0.1314
0.0707 0.0543
0.1444 0.1793
PCA
F -RE
RE
0.3336 0.3354
0.7598 0.7368
0.8906 0.8946
0.6148 0.592
1.631 1.6596
1.5065 1.5148
1.0045 0.9815
2.1414 2.1843
Table 4.3: Reconstruction error of AK-ICA vs. PCA attacks by varying R
89
90
when an relatively large noise (SNR ≤ 5db) is introduced. We would emphasize again that
the spectral filtering (and PCA-based) method can only reconstruct under the additive
noise case (i.e., Y = X + E) while our AK-ICA approach works robust in all cases (i.e.,
Y = RX + E).
0.7
SF
ICA
0.6
0.5
RE
0.4
0.3
0.2
0.1
0
−5
0
5
10
15
20
SNR (db)
Figure 4.9: Reconstruction error vs. SNR for SF and AK-ICA (with fixed size k = 1000)
when Y = X + E
4.5
Summary
In this chapter, we have examined the effectiveness of general projection in privacy
preserving data mining. It was suggested in [Liu, Giannella and Kargupta 2006] that
the non-isometric projection approach is effective to preserve privacy since it is resilient
to the PCA attack which was designed for the distance preserving projection approach.
We proposed an AK-ICA attack, which can be exploited by attackers to breach the
privacy from the non-isometric transformed data. Our theoretical analysis has shown
the proposed attack poses a threat to all projection based privacy preserving methods
when a small sample data set is available to attackers. We argued that it is really a
concern that we need to address in practice.
F -AE
711.5
813.1
2371
5285
13213
17719
SNR
20
15
10
5
0
-5
0.021
0.024
0.070
0.156
0.390
0.523
F -RE
0.018
0.032
0.081
0.156
0.457
0.611
RE
1
0.017
0.030
0.052
0.093
0.173
0.296
2
0.016
0.029
0.052
0.278
0.463
0.678
RE-Ri
3
0.021
0.037
0.066
0.119
0.2111
0.372
4
0.022
0.038
0.182
0.195
0.599
0.793
5
0.016
0.029
0.051
0.094
0.838
0.917
RE-Cj
min max
0.005 0.133
0.009 0.209
0.016 3.026
0.035 2.997
0.337 11.44
0.535 1.470
Table 4.4: Reconstruction error vs. SNR for spectral filtering method when Y = X + E
91
CHAPTER 5: DISCLOSURE ANALYSIS OF THE MODEL-BASED PRIVACY
PRESERVING APPROACH
The issue of confidentiality and privacy in general databases has become increasingly
prominent in recent years. Disclosures that can occur as a result of inferences by snoopers
include two classes: identity disclosure and value disclosure. Identity disclosure relates
to the disclosure of the identity of an individual in the database while value disclosure
relates to the disclosure of the value of a certain confidential attribute of that individual. To prevent disclosures, various randomization based approaches (e.g., [Adam
and Wortman 1989; Agrawal and Srikant 2000; Palley and Simonoff 1987; Sarathy and
Muralidhar 2002]) have been investigated. A key element in preserving privacy and confidentiality of sensitive data is the ability to evaluate the extent of all potential disclosure
for such data. In other words, we need to be able to answer to what extent confidential
information in a perturbed or transformed database can be compromised by attackers or
snoopers. This is a major challenge for current randomization based approaches.
To evaluate the privacy and confidentiality residing in general databases which contain
both categorical attributes and numerical attributes, the authors in [Wu, Wang and
Zheng 2005] proposed a general framework for modeling general databases using the
General Location model. One advantage of the general location model is it can be used
to conduct both identity disclosure and value disclosure respectively since it integrates
both categorical attributes and numerical attributes in one model. The general location
model is defined in terms of the marginal distribution of categorical attributes and the
conditional distribution of numerical attributes given each cell determined by categorical
attributes. The former is described by a multinomial distribution on the cell count
when we summarize the categorical part as a multi-dimensional contingency table. The
numerical attributes of tuples in each cell are assumed to follow a multivariate normal
93
distribution with its parameters µ, Σ, where µ is a vector of means and Σ is a covariance
matrix. It is no wondering that those parameters (e.g., µ, Σ) may be used by attackers or
snoopers to derive some confidential information. For example, from one distribution such
as ”the wages of customers from zip=28223 and race = Asian follow a normal distribution
with mean 70k and standard variance 10k”, snoopers can safely derive a 95% coverage
interval as [50.4k, 89.6k]. This derived coverage interval may violate customers’ privacy
requirement.
To continue this line of the previous work we focus on value disclosure which can
occur as a result of inferences by attackers or snoopers from the multivariate normal
distributions in this dissertation. Furthermore, we will consider various factors in general
databases and conduct disclosure analysis for the following scenarios.
• Basic disclosure scenario - All numerical attributes contained in databases are
sensitive attributes. Various correlations exist among those attributes.
• Conditional disclosure scenario - Databases contain other non-confidential numerical attributes apart from those confidential ones. Here we assume non-confidential
attributes are non-perturbed as they may be retrieved accurately by snoopers from
other public sources. One problem arises as the snoopers may exploit the relationship between non-confidential attributes and confidential attributes to predict
individual values of confidential attributes.
• Linear combination scenario - Databases contain many linear combinations among
both confidential and non-confidential numerical attributes. The combinations here
can be either known or hidden. Many organizational databases typically contain
numerous attributes that could lead themselves to potentially thousands of linear
combinations. In this case, the level of security provided for linear combinations of
confidential attributes could be very low even if the level of security provided for a
single confidential attribute is adequate.
Value disclosure represents the situation where snoopers are able to estimate or infer
the value of a certain confidential numerical attribute of an entity or a group of entities
94
with a level of accuracy greater than a pre-specified level. In our scenario, all numerical
attribute values are modeled by multi-variate normal distributions. Here multi-variate
normal distribution itself is not considered as confidential information, only the parameters µ, Σ may be considered as confidential. The first issue is how to check whether
a given set of µ, Σ, which are used for data generation, provides adequate security for
confidential numerical attributes for an entity or a group of entities. The second issue is
how to modify µ, Σ when they violate privacy and confidentiality requirements.
5.1
The General Location Model Revisited
Let C = {C1 , C2 , · · · , Cq } denote a set of categorical attributes and Z = {Z1 , Z2 , · · · , Zp }
a set of numerical ones in a table with n entries. Suppose Aj takes possible domain values
1, 2, · · · , dj , the categorical data C can be summarized by a contingency table with total
Q
number of cells equal to D = qj=1 dj . Let y = {yd : d = 1, 2, · · · , D} denote the number
P
of entries in each cell. Clearly D
d=1 yd = n. The general location model [Schafer 1997]
is defined in terms of the marginal distribution of C and the conditional distribution of
Z given C. The former is described by a multinomial distribution on the cell counts y,
y | π ∼ M (n, π) =
n!
π 1 y1 · · · π D yD
y1 ! · · · yD !
where π = {πd : d = 1, 2, · · · , D} is an array of cell probabilities corresponding to yd .
For each cell Cd , where d = 1, 2, · · · , D, defined by the categorical attributes C, the
numerical attributes Z are then modeled as a conditionally multi-variate normal as:
f (z|Cd ) =
1
p/2
(2π)
1/2
| Σd |
e−1/2(z−µd )
T
Σd −1 (z−µd )
where p-dimensional vector µd represents the expected value of the random vector z =
(z1 , z2 · · · , zp )T for cell Cd , and the p × p matrix Σd is its variance-covariance matrix.
The parameters of the generation location model can be written as θd = (πd , µd , Σd ),
d = 1, 2, · · · , D.
95
The maximum likelihood estimates of θ is as follows:
π̂d =
yd
n
µ̂Td = yd−1
xd
X
ziT
i=1
Σ̂d = yd−1
xd
X
(zi − µ̂d )(zi − µ̂d )T
(5.1)
i=1
Here we would emphasize that it is feasible to model various data using the general
location model although a group of data may follow some other distributions (e.g., Zipf,
Poisson, Gamma etc.) in practice[Schafer 1997]. As we can see we define multi-variate
normal distribution for data at the finest level and data at higher levels can be taken as
a mixture of multi-variate normal distributions, hence we can theoretically use a mixture
of multi-variate normal distributions to model any other distributions.
It is straightforward to see we can easily generate a dataset when the parameters of
general location model are given. Generally, it involves two steps. First, we estimate
the number of tuples in each cell Cd and generate yd tuples. All yd tuples from this
cell have the same categorical attribute values inherited from the cell location of the
contingency table. Second, we apply multi-variate normal distribution with mean vectors
and covariance matrix to generate numerical attribute values for those tuples in that cell.
5.2
Disclosure Controls For Numerical Data
Value disclosure represents the situation where snoopers are able to estimate or infer
the value of a certain confidential numerical attribute of an entity or a group of entities
with a level of accuracy greater than a pre-specified level. Here an entity or a group of
entities can be characterized by cell they are located in.
In our context, all numerical attribute values are generated from multi-variate normal
distributions. As we discussed before, multi-variate normal distribution itself is not
considered confidential information, only the parameters µd , Σd which are used for data
modeling may contain confidential information. The first issue is how to check whether
a given set of µd , Σd , which are used for data generation, provides adequate security for
confidential numerical attributes for an entity or a group of entities. The second issue is
96
how to modify µd , Σd when they violate privacy and confidentiality requirements.
5.2.1
Basic Disclosure Scenario
From Result 1in 3.6.2, we know the ellipsoid {z : (z − µ)T Σ−1 (z − µ) ≤ χ2p (α)},
which is yielded by the paths of z values, contains a fixed percentage, (1 − α)100% of
customers. In our scenario, snoopers may use various techniques to estimate and predict
the confidential values of individual customers. However, all confidential information
which snoopers can learn is the bound of ellipsoid.
Assume E is the ellipsoid from the original data z at one given confidence level 1 − α.
From the perturbed data ẑ, snoopers can derive the ellipsoid Ê. Equation 5.2 defines the
measure of disclosure of z when z̃ is given.
D(z | z̃) =
| vol(E ∩ Ê) |
| vol(E ∪ Ê) |
(5.2)
Here compromise is said to occur when D(z | z̃) is greater than τ , specified by database
owner. The greater the D(z | z̃), the closer the estimates are to the true distribution, and
the higher the chance of disclosure. In other words, if the ellipsoid learned by snoopers
is close enough to that specified by database owners, we say partial disclosure occurs.
To compute the volume of density contour, we have the following results as shown in
Proposition 3. Please note if our interest is in just a few confidential attribute(say r
attributes), we can easily project the ellipsoid in the original p-dimensional space to the
lower s-dimensional space by replacing z, µ, and Σ in Proposition 3 with zs , µs , and Σs
respectively.
Proposition 3 (Volume of density contour) The volume of an ellipsoid {z : (z−
µ)T Σ−1 (z − µ) ≤ χ2p (α)} is given by
q p
vol(E) = η( χ2p ) | Σ1/2 |
or
p
q pY
p
2
vol(E) = η( χp )
λi
i=1
where η is the volume of the unit ball in Rp , and λi is the i-th eigenvalue of matrix Σ.
97
Proof. From Result 2 in 3.6.2, we know the volume of an ellipsoid {z : (z − µ)T A−1 (z −
µ) ≤ 1} is given by vol(E) = η| A1/2 |. We replace A with Σ/χ2p (α), then we get
q p
vol(E) = η( χ2p ) | Σ1/2 |
From spectral decomposition of Σ as shown in Equation 5.3,
Σ=
p
X
λi ei ei T = P ΛP T
(5.3)
i=1
we get
|Σ| = |P ΛP T | = |P P T ||Λ|
As P P T = I, then we have |Σ| = |Λ|.
Q √
Due to |Σ1/2 | = |Λ|1/2 = pi=1 λi , hence we have
p
q pY
p
vol(E) = η( χ2p )
λi
i=1
Z2
c λ1
c λ2
µ2
µ1
Z1
Figure 5.1: A constant density contour for a bi-variate normal distribution
Example 8 Figure 5.1 shows one constant density contour
containing 95% of the proba

 z1 
bility under the ellipse surface for one bi-variate z = 
, which follows a bi-variate
z2






 µ1 
 σ11 σ12 
 λ1 
normal distribution N (µ, Σ) with µ = 
 and Σ = 
. λ = 

µ2
σ21 σ22
λ2
√
√
is the eigenvalues of covariance matrix Σ and two axes have length of c λ1 and c λ2
98
Z2
Z2
µ2
µ2
µ1
µ1
Z1
(a) Roy’s confidence intervals
Z1
(b) Bonferroni confidence intervals
Figure 5.2: Confidence Intervals
respectively, here c = 2.45 as
p
χ22 (0.05) =
√
5.99 = 2.45. We can see the major axis of
ellipse is associated with the largest eigenvalue (λ1 ). The size of this ellipse is given by
√
5.99 λ1 × λ2 , as χ22 (0.05) = 5.99.
However, to evaluate the measure of disclosure, D(z|z̃), as shown in Equation 5.2, we
need to compute the volume of the intersection (or union) of two ellipsoid. This problem
is shown as NP-hard and some approximation techniques were surveyed in [Henrion,
Tarbouriech and Arzelier 2001]. One heuristic we apply here is to use a hyper-rectangle
to approximate the ellipsoid. As we know computing the intersection (or union) of two
hyper-rectangle in high dimensional space is straightforward. Figure 5.2(a) shows one
Roy’s rectangle formed by the projection of ellipse on z1 and z2 while Figure 5.2(b) shows
one Bonferroni’s rectangle [Johnson and Wichern 1998] which formed by simultaneously
testing the hypothesis about z1 and z2 with an overall conservative significance level
( α2 ). We are conducting the comparison between our method with other approximation
techniques.
In many applications, database owner usually specifies a confidential range [z l , z u ] (z is
a confidential numerical attribute) for an entity or a group of entities. In this case we use
the projection of ellipse on each axis to check whether disclosure occurs. Similarly, the
snoopers can learn confidence interval [ẑl , ẑu ] for each numerical attribute by projecting
ellipse on each axis. If the confidence interval [ẑl , ẑu ] derived by snoopers are close to the
confidential range [z l , z u ] specified by database owner, we say value disclosure occurs.
99
d(z | ẑ) =
[z l , z u ] ∩ [ẑ l , ẑ u ]
[z l , z u ] ∪ [ẑ l , ẑ u ]
(5.4)
Like the measure we defined in 3.6.2, Equation 5.4 defines the measure of disclosure
for one confidential attribute. To compute the projection of one ellipsoid on each axis,
we have the following results as shown in Proposition 4.
Proposition 4 (Simultaneous Confidence Intervals) Let vector Z be distributed as
Np (µ, Σ) with |Σ| > 0. The projection of this ellipsoid {z : (z − µ)T Σ−1 (z − µ) ≤ χ2p (α)}
on axis zi = (0, · · · , 1, · · · , 0)T (only the i-th element is 1, all other elements are 0) has
q
q
bound: [µi − χ2p (α)σii , µi + χ2p (α)σii ]
Proof. From Result 3 in 3.6.2, we know the projection of an ellipsoid {z : z T A−1 z ≤ c2 }
√
on a given unit vector ` has length len = c `0 A`. We replace A with

σ
· · · σ1i · · · σ1p
 11

 .
.
.
.
.


Σ=
 σi1 · · · σii · · · σip


.
.
.
.
 .

σp1 · · · σpi · · · σpp












q
, replace ` as zi = (0, · · · , 1, · · · , 0)T , and replace c as χ2p (α), then we get the length
q
of projection as len = χ2p (α)σii . Considering the center of this ellipsoid, we have the
q
q
bound as [µi − χ2p (α)σii , µi + χ2p (α)σii ].
To check whether a given distribution of z may incur value disclosure, our strategy
here is to compare the disclosure measure D(z|z̃) or d(z|z̃) with τ , specified by the
database owner. If disclosure occurs, we need to modify parameters µ, Σ. As we know
from Proposition 4, the mean vector µ determines the center of ellipsoid or the center of
projection interval while the covariance matrix Σ determines the size of ellipsoid or the
length of projection interval. As the change of µ will significantly affect the data distribution (it will affect the accuracy of analysis or mining subsequently), in the remainder of
this section we focus only on how to change variance matrix Σ to satisfy users’ security
requirements.
100
It is easy to see from Proposition 4 that the confidence interval for each attribute (by
projecting on each axis) is only dependent on µi , σii while it is independent with covariance values σij , where i 6= j. Figure 5.3 illustrates how the shape of ellipse changes when
we vary its covariance matrix Σ using one bi-variate normal distribution example. For
example, by varying σ12 while fixing σ11 , σ22 as shown in Figure 5.3(a), the axes of ellipse
rotate and the ratio between these two axes also changes. However, the projection of ellipse on axis z1 , z2 does not change as it only depends on σ11 and σ22 respectively. Figure
5.3(b) and 5.3(c) illustrate the projection of ellipse only changes when the corresponding
variance (σ11 or σ22 ) is changed.
q
q
From the bound [µi − χ2p (α)σii , µi + χ2p (α)σii ], we can adjust σii to satisfy the
q
given privacy requirement of a confidential attribute Zi : [z l , z u ] ⊆ [µi − χ2p (α)σii , µi +
q
χ2p (α)σii ]. Since we keep the mean values unchanged, we have
zu − zl 2
)
2
2
(z u − z l )
≥
4χ2p (α)
χ2p (α)σii ≥ (
σii
(5.5)
Discussion It is clear that the study of a few confidence intervals is no substitute for
the full confidence region. However, such a confidence region could be visualized only
in two or three dimensions. Thus, for a higher dimension we may have to be content
with confidence intervals. In this chapter, we conduct disclosure analysis by comparing
the best confidence interval (or region) derived by snoopers with the confidence interval
(or region) specified by the database owner at the same confidence level (e.g., 95%).
The previous randomization based approaches are to check whether the probability of
confidential attribute z ∈ [z l , z u ] exceeds a pre-defined confidence threshold (e.g., 95%).
If yes, it means snoopers can confidently predict the confidential value z within the
confidential range, which incurs value disclosure. We can see these two strategies are
equivalent.
101
5.2.2
Conditional Scenario
Consider a database with k numerical, confidential attributes X = (X1 , X2 , · · · , Xk )T
and l non-confidential attributes S = (S1 , S2 , · · · , Sl )T where p = k + l. Security is measured by the degree to which a snooper can determine the values of confidential attributes
in a specific record through the use of relationships between the non-confidential and confidential attributes. One question we ask here is how much information is contained in
non-confidential attributes and how it affects the variability of confidential numerical
attributes.
Proposition
 5(Conditional normal distribution)([Johnson
 andWichern 1998])

 X 
 µX 
 ΣXX ΣXS 
Let Z = 
 be distributed as Np (µ, Σ) with µ = 
, Σ = 

S
µS
ΣSX ΣSS
and |ΣSS | > 0. Then the conditional distribution of X given S = s is normal with mean
−1
= µX + ΣXS Σ−1
SS (s − µS ) and covariance = ΣXX − ΣXS ΣSS ΣSX .
Proposition 5 shows the conditional distribution of X given S is also a multi-variate
normal distribution. Furthermore, the conditional covariance ΣXX − ΣXS Σ−1
SS ΣSX does
not depend upon the values of the conditioning variables. Hence we can simply apply
results from Proposition 3 and 4 by replacing Σ with the new conditional covariance
ΣXX − ΣXS Σ−1
SS ΣSX to conduct conditional value disclosure analysis.
Let A = ΣXX − ΣXS Σ−1
SS ΣSX . Using the same strategy 5.5 proposed in 5.2.1, we
adjust those diagonal entries (variances) of A to satisfy the given privacy requirement.
The adjusted covariance matrix is denoted as Ã. Therefore, the adjusted covariance
matrix for the confidential attributes is naturally derived:
Σ̃XX = ΣXS Σ−1
SS ΣSX + Ã
(5.6)
We will discuss the strategy to adjust covariance matrix ΣXS or ΣSX in Section 5.2.3.
In general, given a confidential variable x whose variance is σx2 , the largest eigenvalue
−1
(λ) of Σ−1
XX ΣXS ΣSS ΣSX , gives the proportion of the variance(fluctuation) of variable x
that is predictable from non-confidential attributes S. The eigenvalue is a measure of how
well the non-confidential attributes can predict the confidential attribute x, e.g., λ = 0.852
Z2
Z2
102
16
20
Σ=
12
20
Σ=
20
20
Σ=
2
12
12 
30
20
30 
2
30
16
12
µ2 8
µ2 8
4
4
0
0
4
7
µ1
10
20 12 
Σ=

12 30
30 12 
Σ=

12 30
10 12 
Σ=

12 30
0
0
14
4
7
µ1
Z1
Z2
(a) Vary σ12
10
14
Z1
(b) Vary σ11
16
20 12 
Σ=

12 30
20 12 
Σ=

12 40
20 12
Σ=

12 16
12
µ2 8
4
0
0
4
7
µ1
10
14
Z1
(c) Vary σ22
Figure 5.3: Density contour with varied covariance matrix
means that 85% of the total variation in x can be explained by the linear relationship
between S and x. The other 15% of the total variation in x remains unexplained. Hence,
0.5
a rough estimate of the smallest standard error can be determined as [(1 − λ)σx2 ] . Based
on this estimate of standard error, a rough 95% confidence interval for x is given as:
µ̂x ± 1.96[(1 − λ)σx2 ]
5.2.3
0.5
Combination Scenario
Many organizational databases typically contain numerous attributes that could lead
themselves to potential thousands of linear combinations. In this case, the threat of
combination disclosure can be magnified further. For example, the prediction of the
linear combination Total Income = Wages + Interests + Dividends is likely to have a
high level of accuracy than that of each individual attribute.
The approach we apply here is based on Canonical Correlation Analysis (CCA) [John-
103
son and Wichern 1998]. CCA can be used to measure the maximum proportion of the
variance that can be explained in any linear combination of confidential attributes X,
using a linear combination of known non-confidential attributes S. The main task of
CCA is to summarize the associations between the X and S sets in terms of a few carefully chosen covariances rather than the pq covariances in ΣXS . We denote the respective
linear combinations by u = aT x and v = bT s. The correlation between u and v is given
by
Corr(u, v) =
aT ΣXS b
[(aT ΣXX a)(bT ΣSS b)]1/2
Out of the infinite number of linear combinations, we find the set of linear combinations
−1/2
which maximizes the correlation Corr(u, v). The canonical variate pair ui = eTi ΣXX x
√
−1/2
and vi = fiT ΣSS s maximizes Corr(ui , vi ) = λi , where i = 1, · · · , l. Here λ1 ≥ · · · ≥
−1/2
−1/2
λl are the eigenvalues of ΣSS ΣSX Σ−1
XX ΣXS ΣSS
, and e1 , · · · , el are the associated
normalized eigenvectors as shown in Equation 5.7.
−1/2
−1/2
A = ΣSS ΣSX Σ−1
XX ΣXS ΣSS
= λ1 e1 e1 T + · · · + λl el el T
(5.7)
The largest eigenvalue λ1 is the squared canonical correlation coefficient, which represents the most general measure of inferential value disclosure for any combination. In
other words, 1 − λ1 represents the worst-case security. When some λi is greater than the
threshold λ∗ , specified by database owners, it means some combination disclosure exists
for one potential combination of confidential attributes. In this case, we need to change
parameters, ΣSX or ΣXX , so that all new eigenvalues should be less than or equal to the
threshold λ∗ .
Our approach here is we set those eigenvalues λi as λ∗ (hence no combination disclosure
exists) and keep the other eigenvalues (λi < λ) and all eigenvectors unchanged. We
get a new matrix à after applying the inverse of spectral decomposition as shown in
Equation 5.7. The derived matrix à is guaranteed to satisfy users’ security requirement
for all the possible combinations. Furthermore, as we keep all eigenvectors and those
other eigenvalues (λi ≤ λ∗ ) unchanged in our approach, the density contour of modified
distribution will be closest to that of the original one.
104
From Equation 5.7, we know Ã, which will satisfy users’ security requirements, is
determined by ΣXX and ΣXS . So we can adjust either ΣXX or ΣXS to achieve Ã. Note
ΣSS should be kept unchanged as we assume data of non-confidential attributes are
non-perturbed.
To adjust ΣXX , we simply set Σ̃XX as
−1/2
−1/2
Σ̃XX = ΣXS ΣSS Ã−1 ΣSS ΣSX
−1/2
(5.8)
−1/2
However, there is no direct method to adjust ΣXS . From ΣSS Σ̃SX Σ−1
XX Σ̃XS ΣSS
= Ã,
we expand the left side of equation and get an l × l matrix. Each element of this matrix
is a quadratic function fij (x11 , · · · xlk ) which equals to the corresponding ãij . Then we
get l × l sub-quadratic equations with l × k variables. The problem becomes the following
optimization problem:
Problem 1 Minimize F (x11 , · · · , xlk ) = Σli=1 Σkj=1 (fij (x11 , · · · , xlk ) − ãij )2
subject to xij ≥ 0.
5.3
Summary
Various disclosure scenarios may exist in the general databases. In order to preserve
the privacy for individual users, a general location model was built for database application testing. The numerical data in the general databases are modeled with different
multivariate normal distributions. In this chapter, we focused on the disclosure control
for the numerical data in this model. We presented how to satisfy users’ privacy requirements by adjusting parameters of the model learned. Discussion has been made under
three different scenarios: basic scenario, conditional scenario and combination scenario.
In the basic scenario, disclosure is control by considering confidential attributes alone.
In the conditional scenario, non-confidential attributes are considered and the conditional distributions of confidential attributes are adjusted to satisfy privacy concerns. In
the combination scenario, potential combination disclosure is controlled using canonical
correlation analysis.
CHAPTER 6: CONCLUSIONS AND FUTURE WORK
6.1
Summary
Driven by one of the major policy issues of the information era- the right to privacy,
Privacy-Preserving Data Mining (PPDM) becomes one of the newest trends in privacy
and security research. Great interests have been found from both academia and industry:
a) the recent proliferation in PPDM techniques is evident; b) the interest from academia
and industry has grown quickly; c)separate workshops and conferences devoted to this
topic have emerged in the last few years.
Privacy issues have posed new challenges for novel uses of data mining technology.
Instead of releasing the original data directly for analysis, a complex process shall be
applied to protect the sensitive data before sharing the data for mining. One of primary
tools for such data protection process is randomization. We expect the privacy and data
utility can be well balanced in the context. In this study, we addressed the problem of
balancing the privacy and data utility by analyzing different perturbation models.
In additive-noise-based perturbation model, spectral-filtering-based technique has recently been investigated as a major means of point-wise data reconstruction [Huang,
Du and Chen 2005; Kargupta et al. 2003]. It was empirically shown that under certain
conditions this technique may be exploited by attackers to breach the privacy protection
offered by randomization based privacy preserving data mining methods. We presented
a theoretical study on evaluating privacy breaches when spectral-filtering-based technique is applied. We gave an explicit upper bound of reconstruction accuracy in terms
of Frobenius norm. This upper bound may be exploited by attackers to determine how
close their estimates are from the original data using spectral-filtering-based technique,
which imposes a serious threat of privacy breaches. We also derived an explicit lower
bound of reconstruction accuracy in terms of Frobenius norm. This lower bound can help
106
users determine how much and what kind of noise should be added when one tolerated
privacy breach threshold is given.
In projective-transformation-based perturbation model, isometric projection was proven
to be invariant for many popular classifiers. However, it was suggested in [Liu, Giannella
and Kargupta 2006] that the non-isometric projection approach is effective to preserve
privacy since it is resilient to the PCA attack which was designed for the distance preserving projection approach. We proposed an AK-ICA attack, which can be exploited by
attackers to breach the privacy from the non-isometric transformed data. Our theoretical
analysis and empirical evaluations have shown the proposed attack poses a threat to all
projection based privacy preserving methods when a small sample data set is available
to attackers. We argue this is really a concern that we need to address in practice.
Considering all the potential threats for the microdata, a model-based approach was
designed. Instead of controlling the privacy based on the actual data, model-based approach fits the original data into a carefully designed model and controls the privacy
based on parameter space of the model. We focused on the numerical part of our model
and provided disclosure control schemes for three different scenarios: basic scenario, conditional scenario and combination scenario.
6.2
Contributions
We now summarize the main contributions achieved in this research. As parts of a
novel framework for privacy preserving data mining, our contributions lie in:
1. Bound analysis of the spectral filtering technique. In particular, we first derived
one upper bound for the Frobenius norm of reconstruction error using the matrix
perturbation theory. This upper bound may be exploited by attackers to determine
how close their estimates are to the original data using spectral filtering based
techniques, which imposes a serious threat of privacy breaches. We also derived
a lower bound for the reconstruction error, which can help data owners determine
how much noise should be added to satisfy one given threshold of tolerated privacy
breach. Besides, an improved data reconstruction strategy for noise filtering was
also given. In the context of the additive-noise-based perturbation, our proposed
107
new strategy compares the benefit due to inclusion of one component with the loss
due to the additional projected noise. We showed that such strategy is expected to
give an approximately optimal reconstruction from the perturbed data.
2. An effective attacking method to break general projective-transformation-based
perturbation. By combining a known small subset of the original data, which is
reasonable in practice, our algorithm, AK-ICA, can effectively estimate the whole
original data set with high accuracy. The nice properties of such attacking method
also include its robustness to arbitrary projective transformation. All the previous
perturbation methods under this context are vulnerable to our attack. Therefore,
current projective transformed privacy preserving data mining techniques may need
a careful scrutiny in order to prevent privacy breaches when a subset of sample data
are available.
3. A measure for the disclosure of individual privacy. We proposed a way to measure
how close the IQR obtained by attackers or snoopers is to individuals privacy
interval for some particular sensitive variable. We also extended such measure to
multivariate case based on the confidential region.
4. Disclosure control methods for various scenarios in a model based privacy preserving
data mining. General databases typically contain numerous attributes with different privacy concerns. To satisfy different privacy requirements from data providers,
we analyzed potential privacy disclosures in several scenarios and found ways to
adjust parameters of the model learned from the underlying data.
6.3
Future Research
Several directions can be exploited as a continuation of this research. We discuss some
extensions we are going to touch and technical challenges we would like to address in this
section.
1. To explore the potential attacks for additive-noise-based randomization. Spectral
filtering technique was proven to be an effective tool to estimate original sensitive
108
values from the perturbed data. However, the noise is hard to be filtered out
when the noise is correlated with the original data or signal-to-noise ratio is low.
To improve existing data reconstruction algorithms for the additive-noise-based
model and to explore other possible attacks by combining various techniques (e.g.
statistical approach, signal processing etc.) will be an attractive direction for the
future research.
2. To explore how the properties of the sample affect the estimate accuracy for randomization. In randomization approach, original data is perturbed in various ways
and only perturbed data are provided for analyzing. If a small set of sensitive data
is available in practice, such known sample might be combined to threaten the privacy. The known sample could be some columns (insensitive attributes), or some
rows (individual records), or, more generally, some cell values. To explore how the
known sample affects the privacy, more research effort is needed which might combine various domains of knowledge, including randomization and approximation in
linear algebra, multivariate statistics, etc.
3. To investigate the end-user-oriented privacy preserving data mining. Each end-user
may have different privacy concern when sharing the data. For example, Alice only
allows her salary to be perturbed to her acceptable range, or her zip code to be
trans-formed to the one in other states. Previous randomization models perturb the
original data set as a whole without combining various individual privacy concerns.
Due to the complex con-text and the dependence between the original value and
the noise (or transformation), it is challenging to build a good data mining model
to exact interesting patterns from the observed perturbed data.
4. To investigate the effect of randomization on the utility of mining tasks. An essential goal of privacy preserving data mining is to balance the preserved privacy and
utility of the data. Adding more noise might make the data more secure, however,
it might also sacrifice its worth for the miner. In our future work, we will emphasis the data utility by investigating different mining tasks which are based on the
109
distribution learnt from the perturbed data.
5. To apply randomized response (RR) technique for numerical data in privacy preserving data mining. Randomized response is considered as an efficient tool to
protect privacy. Early RR models were designed for categorical data which can
be naturally partitioned into mutually exclusive and exhaustive classes. We also
noticed that several other models were proposed as extensions for numerical data
and corresponding statistical analysis was given [Poole 1974; Duffy and Waterton 1984; Poole and Clayton 1982]. Since numerical data is our focus in this study,
in the future research, we would like to address mining issues under this context.
In particular, we will investigate the accuracy of mining tasks from the scrambled
response.
6. To enhance privacy for existing systems by applying multiple security and privacypreserving techniques, e.g. randomization, cryptography, secure multiparty computation and access control.
110
REFERENCES
Adam, N. and Wortman, J. 1989 Security-control methods for statistical databases. ACM
Computing Surveys, 21, Nr. 4, 515–556
Aggarwal, C. and Yu, P. 2004 A condensation approach to privacy preserving data
mining. In Proceedings of International Conference on Extending Database Technology.
Springer Berlin / Heidelberg, 183–199
Agrawal, D. and Agrawal, C. 2001 On the design and quantification of privacy preserving
data mining algorithms. In Proceedings of the 20th Symposium on Principles of Database
Systems.
Agrawal, R. and Srikant, R. 2000 Privacy-preserving data mining. In Proceedings of the
ACM SIGMOD International Conference on Management of Data. Dallas, Texas, 439–
450
Ashley, P. et al. 2002 E-P3P privacy policies and privacy authorization. In Proceedings
of the 2002 ACM workshop on Privacy in the Electronic Society. New York, NY, USA:
ACM Press, ISBN 1–58113–633–1, 103–109
Backes, M., Pfitzmann, B. and Schunter, M. 2003 A toolkit for managing enterprise privacy policies. In 8th European Symposium on Research in Computer Security (ESORICS
2003)., 162–180
Benenson, Z., Freiling, F. and Kesdogan, D. 2005 Secure Multi-Party Computation with
Security Modules. In Proceedings of SICHERHEIT 2005., 41–52
Cardoso, J. 1999 High-order contrasts for independent component analysis. Neural Computation, 11, Nr. 1, 157–192
Chang, L. and Moskowitz, I. S. 2000 An Integrated Framework for Database Privacy
Protection. In Proceedings of the fourteenth Annual IFIP WG 11.3 Working Conference
on Database Security., 161–172
Chaudhuri, A. and Mukerjee, R. 1988 Randomized Response: Theory and Techniques.
Marcel Dekker, Inc
Chen, K. and Liu, L. 2005 Privacy preserving data classification with rotation perturbation. In Proceedings of the 5th IEEE International Conference on Data Mining. Houston,TX
Chen, K. and Liu, L. 2007 Towards Attack-Resilient Geometric Data Perturbation. In Proceedings of the 7th Society for Industrial and Applied Mathematics
(SIAM)International Conference on Data Mining. Minneapolis, Minnesota
Clifton, C. et al. 2003 Tools for Privacy Preserving Distributed Data Mining. ACM
SIGKDD Explorations Newsletter, 4, Nr. 2, 28–34
Commission, E. 1998a Directive 95/46/EC on the protection of individuals with regard
to the processing of personal data and on the free movement of such data. hURL: http:
//ec.europa.eu/justice home/fsj/privacy/law/index en.htmi
111
Commission, U. F. T. 1998b Children’s Online Privacy Protection Act. hURL: http:
//www.ftc.gov/ogc/coppa1.htmi
Congress, U. 1996 Health Insurance Portability and Accountability Act. hURL: http:
//www.cms.hhs.gov/HIPAAGenInfo/i
Congress, U. 1999 Gramm-Leach-Bliley Act. hURL: http://www.ftc.gov/privacy/
privacyinitiat%ives/glbact.htmli
Conover, W. 1998 Practical Nonparametric Statistics. Wiley
Cox, L. 1980 Suppresion Methodology and Statistical Disclosure Control. Journal of the
American Statistical Association, 75, Nr. 370, 377–385
Dalenius, T. and Reiss, S. P. 1982 Data-swapping: A technique for disclosure control.
Journal of Statistical Planning and Inference, 6, 73C85
Dempster, A. P., Laird, N. M. and Rubin, D. B. 1977 Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, Nr. 1,
1–38
Denning, D. E., Schlörer, J. and Wehrle, E. 1982 Memoryless Inference Controls for
Statistical Databases., 38–45
Domingo-Ferrer, J. and Mateo-Sanz, J. M. 2002 Practical Data-Oriented Microaggregation for Statistical Disclosure Control. IEEE Transactions on Knowledge and Data
Engineering, 14, Nr. 1, 189–201, ISSN 1041–4347
Domingo-Ferrer, J. and Torra, V. 2003 On the connections between statistical disclosure
control for microdata and some artificial intelligence tools. Information SciencesInformatics and Computer Science: An International Journal, 151, 153–170, ISSN 0020–0255
Du, W. and Zhan, Z. 2003 Using randomized response techniques for privacy-preserving
data mining. In Proceedings of the 9th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press New York, NY, USA, 505–510
Du, W., Han, Y. S. and Chen, S. 2004 Privacy-Preserving Multivariate Statistical Analysis: Linear Regression and Classification. In Proceedings of the 4th SIAM International
Conference on Data Mining. ACM Press New York, NY, USA, 222–233
Duffy, J. C. and Waterton, J. J. 1984 Randomized response models for estimating the
distribution. function of a quantitative character. International Statistical Review, 52,
Nr. 2, 165–171
Duncan, G. T. and Mukherjee, S. 2000 Optimal Disclosure Limitation Strategy in Statistical Databases: Deterring Tracker Attacks through Additive Noise. Journal of the
American Statistical Association, 95, 720–729
Evans, T., Zayatz, L. and Slanta, J. 1998 Using noise for disclosure limitation of establishment tabular data. Journal of Official Statistics, 14, Nr. 4, 537–551
Evfimievski, A. et al. 2002 Privacy preserving mining of association rules. In Proceedings
of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. Edmonton, Canada, 217–228
112
Fienberg, S. E. and McIntyre, J. 2003 Data swapping: Variations on a theme by dalenius and reiss. National Institute of Statistical Sciences, Research Triangle Park, NC –
Technical report
Fienberg, S. E., Makov, U. E. and Steele, R. J. 1998 Disclosure Limitation Using Perturbation and Related Methods for Categorical Data. Journal of Official Statistics, 14,
Nr. 4, 485–502
Fischer-Hübner, S. 2001 IT-security and privacy: design and use of privacy-enhancing
security mechanisms. Lecture Notes in Computer Science 1958, ISBN 3–540–42142–4
Fisz, M. 1963 Probability theory and mathematical statistics. John Wiley and Sons, Inc.
G. J. McLachlan, T. K. 1998 The EM Algorithm and Extensions. The Statistician, 47,
Nr. 3, 554 – 555
Gennaro, R., Rabin, M. O. and Rabin, T. 1998 Simplified VSS and fast-track multiparty
computations with applications to threshold cryptography. In PODC ’98: Proceedings
of the seventeenth annual ACM symposium on Principles of distributed computing. New
York, NY, USA: ACM Press, ISBN 0–89791–977–7, 101–111
Gilburd, B., Schuster, A. and Wolff, R. 2004 A New Privacy Model and AssociationRule Mining Algorithm for Large-Scale Distributed Environments. In Proceedings of the
10th ACM SIGKDD international conference on Knowledge discovery and data mining.
ACM Press New York, NY, USA
Goldreich, O., Micali, S. and Wigderson, A. 1987 How to Play Any Mental Games. In
Proceedings of the 19th Annual ACM Symposium on Theory of Computing., 218–229
Gouweleeuw, J. et al. 1998 Post Randomisation for Statistical Disclosure Control: Theory
and Implementation. Journal of Official Statistics, 14, Nr. 4, 463–478
Grotschel, M., Lovasz, L. and Schrijver, A. 1988 Geometric Algorithms and Combinatorial Optimization. Springer, New York
Guo, L., Guo, S. and Wu, X. 2007 Privacy Preserving Market Basket Data Analysis. In
Proceedings of the 11th European Conference on Principles and Practice of Knowledge
Discovery in Databases. Warsaw, Poland, 103–114
Guo, S. and Wu, X. 2006 On the Use of Spectral Filtering for Privacy Preserving
Data Mining. In Proceedings of the 21st ACM Symposium on Applied Computing. Dijion,France
Guo, S. and Wu, X. 2007 Deriving Private Information from Arbitrarily Projected Data.
In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data
Mining. Nanjing, China
Guo, S., Wu, X. and Li, Y. 2006a Deriving Private Information from Perturbed Data Using IQR based Approach. In 2nd International Workshop on Privacy Data Management.
Atlanta, USA
113
Guo, S., Wu, X. and Li, Y. 2006b On the lower bound of reconstruction error for spectral
filting based privacy preserving data mining. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06).
Berlin,Germany
Henrion, D., Tarbouriech, S. and Arzelier, D. 2001 LMI approximations for the radius of
the intersection of ellipsoids: a survey. Journal of Optimization Theory and Applications
108, Nr. 1
Huang, Z., Du, W. and Chen, B. 2005 Deriving private information from randomized
data. In Proceedings of the ACM SIGMOD Conference on Management of Data. Baltimore, MA
Hyvarinen, A., Karhunen, J. and Oja, E. 2001 Independent Component Analysis. John
Wiley & Sons
Johnson, R. and Wichern, D. 1998 Applied Multivariate Statistical Analysis. Prentice
Hall
Kam, J. B. and Ullman, J. D. 1977 A model of statistical database their security. ACM
Transactions on Database Systems, 2, Nr. 1, 1–10, ISSN 0362–5915
Kargupta, H. et al. 2003 On the Privacy Preserving Properties of Random Data Perturbation Techniques. In Proceedings of the 3rd International Conference on Data Mining.,
99–106
Karjoth, G., Schunter, M. and Waidner, M. 2002 The platform for enterprise privacy practices - privacyenabled management of customer data. In Proceedings of the 2nd Workshop
on Privacy Enhancing Technologies (PET 2002). Springer-Verlag New York, Inc., 69–84
Karjoth, G. and Schunter, M. 2002 A Privacy Policy Model for Enterprises. In Proceedings
of the 15th IEEE workshop on Computer Security Foundations. Washington, DC, USA:
IEEE Computer Society, ISBN 0–7695–1689–0, 271
Kozlov, M. K., Tarasov, S. P. and Khachian, L. G. 1979 Polynomial solvability of convex
quadratic programming. Soviet Mathematics Doklady, 20, 1108–1111
LeFevre, K., DeWitt, D. J. and Ramakrishnan, R. 2006 Mondrian Multidimensional KAnonymity. In Proceedings of the 22nd International Conference on Data Engineering
(ICDE’06). Washington, DC, USA: IEEE Computer Society, ISBN 0–7695–2570–9, 25
Li, N., Li, T. and Venkatasubramanian, S. 2007 t-Closeness: Privacy Beyond kAnonymity and l-Diversity. In Proceedings of the 23rd International Conference on Data
Engineering (ICDE’07). Washington, DC, USA: IEEE Computer Society, ISBN 1–4244–
0803–2, 106–115
Lindell, Y. and Pinkas, B. 2002 Privacy preserving data mining. Journal of Cryptology,
15, Nr. 3, 177–206
Liu, K., Giannella, C. and Kargupta, H. 2006 An attacker’s view of distance preserving
maps for privacy preserving data mining. In Proceedings of the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06).
Berlin,Germany
114
Liu, K., Kargupta, H. and Ryan, J. 2006 Random projection based multiplicative data
perturbation for privacy preserving distributed data mining. IEEE Transaction on Knowledge and Data Engineering, 18, Nr. 1, 92–106
Machanavajjhala, A. et al. 2006 l-diversity: Privacy beyond k-anonymity. In Proceedings
of the 22nd International Conference on Data Engineering. Atlanta, GA, 24
Malvestuto, F. M., Moscarini, M. and Rafanelli, M. 1991 Suppressing marginal cells to
protect sensitive information in a two-dimensional statistical table (extended abstract).
In PODS ’91: Proceedings of the tenth ACM SIGACT-SIGMOD-SIGART symposium on
Principles of database systems. New York, NY, USA: ACM Press, ISBN 0–89791–430–9,
252–258
Meng, D., Sivakumar, K. and Kargupta, H. 2004 Privacy-Sensitive Bayesian Network
Parameter Learning. In Proceedings of the 4th IEEE International Conference on Data
Mining. IEEE Computer Society, Washington, DC, USA, 487–490
Mirsky, L. 1960 symmetric gage function and unitarily invariance. Quarterly Journal of
Mathematics, 11, Nr. 1, 50–59
Mukherjee, S. and Duncan, G. T. 1997 Disclosure Limitation through Additive Noise
Data Masking: Analysis of Skewed Sensitive Data. In HICSS ’97: Proceedings of the
30th Hawaii International Conference on System Sciences. Washington, DC, USA: IEEE
Computer Society, ISBN 0–8186–7743–0, 581
Oliveira, S. and Zaiane, O. 2004 Achieving privacy preservation when sharing data for
clustering. In Proceedings of the Workshop on Secure Data Management in a Connected
World. Toronto,Canada, 67–82
Özsoyoglu, G. and Chung, J. 1986 Information Loss in the Lattice Model of Summary
Tables due to Cell Suppression. In Proceedings of the Second International Conference
on Data Engineering. Washington, DC, USA: IEEE Computer Society, ISBN 0–8186–
0655–X, 75 – 83
Palley, M. A. and Simonoff, J. S. 1987 The use of regression methodology for the compromise of confidential information in statistical databases. ACM Transactions on Database
Systems, 12, Nr. 4, 593–608
Papageorgiou, H. et al. 2001 A Statistical Metadata Model for Simultaneous Manipulation
of both Data and Metadata. J. Intell. Inf. Syst. 17, Nr. 2-3, 169–192, ISSN 0925–9902
Parzen, E. 1962 On the estimation of a probability density function and mode. Annals
of Mathematical Statistics, 33, 1065–1076
Pinkas, B. 2002 Cryptographic techniques for privacy preserving data mining. SIGKDD
Explorations Newsletter, 4, Nr. 2, 12–19
Poole, W. K. 1974 Estimation of the Distribution Function of a Continuous Type Random
Variable Through Randomized Response. Journal of the American Statistical Association, 69, 1002–1005
115
Poole, W. and Clayton, A. C. 1982 Generalizations of a Contamination. Model for Continuous Type Random Variables. Communications in Statistics Theory Methods, 11,
1733–1742
Raghunathan, T., Reiter, J. and Rubin, D. 2003 Multiple Imputation for Statistical
Disclosure Limitation. 19, Nr. 1, 1–16
Ramesh, G., Maniatty, W. and Zaki, M. 2003 Feasible itemset distributions in data
mining: theory and application. In Proceedings of the 22nd ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems., 284–295
Reiter, J. P. 2002 Satisfying Disclosure Restrictions With Synthetic Data Sets. 18, Nr. 4,
531–543
Reiter, J. P. 2003 Inference for partially synthetic, public use microdata sets. 29, Nr. 2,
181–188
Rizvi, S. and Haritsa, J. 2002 Maintaining data privacy in association rule mining. In
Proceedings of the 28th International Conference on Very Large Data Bases.
Rubin, D. B. 1987 Multiple Imputation for Nonresponse in Surveys. Volume 1, Wiley
Rubin, D. B. 1993 Discussion Statistical Disclosure Limitation. 9, Nr. 2, 461–468
Samarati, P. 2001 Protecting Respondents’ Identities in Microdata Release. IEEE Transactions on Knowledge and Data Engineering, 13, Nr. 6, 1010–1027, ISSN 1041–4347
Samarati, P. and Sweeney, L. 1998 Protecting Privacy when Disclosing Information:
k-Anonymity and its Enforcement through Generalization and Suppression. Computer
Science Laboratory, SRI International – Technical report hURL: http://www.csl.sri.
com/papers/sritr-98-04/i
Sande, G. 1983 Automated cell suppression to preserve confidentiality of business statistics. In Proceedings of the Second International Workshop on Statistical Database Management. Berkeley, CA, US: Lawrence Berkeley Laboratory, ISBN 1–87654–234–X, 346–
354
Sanil, A. P. et al. 2004 Privacy preserving regression modelling via distributed computation. In Proceedings of the 10th ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM Press New York, NY, USA, 677–682
Sarathy, R. and Muralidhar, K. 2002 The security of confidential numerical data in
databases. Information Systems Research, 13, Nr. 4, 389–404
Schafer, J. 1997 Analysis of Incomplete Multivariate Data. Chapman Hall
Shannon, C. E. 1948 A Mathematical Theory of Communication. Bell System Technical
Journal, 27, 379–423
Shannon, C. E. 1949 Communication theory of secrecy systems. Bell System Technical
Journal, 28, Nr. 4, 656–715
Stewart, G. W. 1980 The efficient generation of random orthogonal matrices with an
application to condition estimation., 403–409
116
Stewart, G. and Sun, J. 1990 Matrix Perturbation Theory. Academic Press
Sweeney, L. 2002 k-anonymity: a model for protecting privacy. 10, Nr. 5, 557–570, ISSN
0218–4885
Tan, V. Y. F. and Ng, S.-K. 2007 Generic Probability Density Function Reconstruction for Randomization in Privacy-Preserving Data Mining. In Machine Learning and
Data Mining in Pattern Recognition, 5th International Conference(MLDM 2007). Volume 4571, Springer, 76–90
Vaidya, J. and Clifton, C. 2003 Privacy preserving k-means clustering over vertically
partitioned data. In Proceedings of the 9th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining., 206–215
Vaidya, J. and Clifton, C. 2002 Privacy Preserving Association Rule Mining in Vertically
Partitioned Data. In Proceedings of the 8th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM Press New York, NY, USA, 639–644
W3C 2002 Platform for Privacy Preferences (P3P). hURL: http://www.w3.org/TR/
P3P/i
Wang, K., Yu, P. and Chakraborty, S. 2004 Bottom-up generalization: a data mining
solution to privacy protection. In Proceedings of the 4th IEEE International Conference
on Data Mining. Brighton, UK
Wang, Y., Wu, X. and Zheng, Y. 2004 Privacy Preserving Data Generation for Database
Application Performance Testing. In Proceedings of 1st International Conference on Trust
and Privacy in Digital Business (TrustBus04)., 142–151
Warner, S. L. 1965 Randomized Response: A Survey Technique for Eliminating evasive
answer bias. The American Statistical Association, 60, Nr. 309, 63–69
Weyl, H. 1911 Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller
Differentialgleichungen. Mathematische Annalen, 71, 441–479
Wright, R. and Yang, Z. 2004 Privacy-preserving Bayesian network structure computation on distributed heterogeneous data. In Proceedings of the 10th ACM SIGKDD
international conference on Knowledge discovery and data mining. ACM Press New
York, NY, USA, 713 – 718
Wu, C. W. 2003 Privacy Preserving Data Mining: A Signal Processing Perspective And A
Simple Data Perturbation Protocol. In In IEEE ICDM Workshop on Privacy Preserving
Data Mining., 10–17
Wu, X. et al. 2005a Privacy Aware Data Generation for Testing Database Applications. In
Proceedings of the 9th International Database Engineering and Application Symposium.,
317–326
Wu, X., Wang, Y. and Zheng, Y. 2003 Privacy preserving database application testing.
In Proceedings of the ACM Workshop on Privacy in Electronic Society., 118–128
117
Wu, X., Wang, Y. and Zheng, Y. 2005 Statistical database modeling for privacy preserving database generation. In Proceedings of the 15th International Symposium on Methodologies for Intelligent Systems.
Wu, X. et al. 2005b Privacy-Aware Market Basket Data Set Generation: A Feasible
Approach for Inverse Frequent Set Mining. In Proceedings of the 5th SIAM International
Conference on Data Mining., 103–114
Yao, A. 1982 Privacy Preserving Market Basket Data Analysis. In Proceedings of the 23rd
Annual Symposium on Foundations of Computer Science., 160–164