Download Impact of known Input Output attack in CAMDP technique for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
Impact of known Input Output attack in CAMDP technique for
privacy Preserving Data Mining
Bhupendra Kumar Pandya
Umesh kumar Singh
Keerti Dixit
Institute of Computer Science
Institute of Computer Science
Institute of Computer Science
Vikram University, Ujjain
Vikram University, Ujjain
Vikram University, Ujjain
[email protected] [email protected]
[email protected]
Abstract:
Privacy preservation has become a major issue in many data mining applications. When a data
set is released to other parties for data mining, some privacy-preserving technique is often
required to reduce the possibility of identifying sensitive information about individuals. Many
data mining applications deal with privacy sensitive data. Financial transactions, health-care
records, and network communication traffic are some examples. Data mining in such privacysensitive domains is facing growing concerns. Therefore, we need to develop data mining
techniques that are sensitive to the privacy issue. This research paper considers a CAMDP
(Combination of Additive and Multiplicative Data Perturbation) technique for privacy preserving
data mining. This technique explores the possibility for constructing a new representation of the
data. It can be proved that the CAMDP technique for Privacy Preserving Data Mining can be
applied for several categories of popular data mining models with better utility preservation and
privacy preservation. This research paper presents extensive theoretical analysis and
experimental results on the accuracy and privacy of the CAMDP technique for privacy
preserving data mining. We examine how well the attacker can recover the original data from the
transformed data and prior information.
Keyword:- CAMDP, I/O Attack
132
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
Perturbation
1. Introduction
Privacy
and
security,
particularly
maintaining confidentiality of data, have
become a challenging issue with advances in
techniques.
This
Method
combines the strength of the translation and
distance preserving method.
2.1. Translation Based Perturbation
information and communication technology.
The ability to communicate and share data
has many benefits. Progress in scientific
research depends on availability and sharing
of information and ideas. But protecting the
privacy of human participant is given top
In TBP method, the observations of
confidential attributes are perturbed using an
additive noise perturbation. Here we apply
the noise term applied for each confidential
attribute which is constant and value can be
either positive or negative.
priority by the researcher.
Therefore, we need to develop data mining
2.2.
techniques that are sensitive to the privacy
To
issue. This has fostered the development of
transformation [3-7], let us start with the
a class of data mining algorithms [1,2] that
definition of metric space. In mathematics, a
try to extract the data pattern without
metric space is a set S with a global distance
directly accessing the original data and
function (the metric d) that, for every two
guarantees that the mining process does not
points x, y in S, gives the distance between
get sufficient information to reconstruct the
them as a nonnegative real number d(x, y).
original data.
Usually, we denote a metric space by a 2-
In
this
paper,
we
multidimensional
analyze
data
technique:
CAMDP
Additive
and
a
new
perturbation
(Combination
Multiplicative
of
Data
Distance Based Perturbation
define
2. d(x, y) = d(y, x) (symmetry),
3. d(x, y) + d(y, z) ≥ d(x, z) (triangle
Mining.
2.3.
Many
Generation
Matrix
matrix
The CAMDP technique is a Combination of
orthogonal
Additive
decomposition,
Multiplicative
Data
preserving
1. d(x, y) = 0 iff x = y (identity),
inequality).
and
distance
tuple (S, d). A metric space must also satisfy
Perturbation) for Privacy Preserving Data
2. CAMDP Technique:
the
of
Orthogonal
decompositions
matrices,
such
SVD,
involve
as
QR
spectral
133
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
decomposition and polar decomposition. To
can be applied to the transformed data and
generate a uniformly distributed random
produce exactly the same results as if
orthogonal matrix, we usually fill a matrix
applied to the original data, e.g., KNN
with independent Gaussian random entries,
classifier,
then use QR decomposition.
vector machine, distance-based clustering
perception
learning,
support
and outlier detection.
2.4. Data Perturbation Model
Translation and Orthogonal transformation-
3. CAMDP Algorithm
based data perturbation can be implemented
Algorithm: Privacy Preserving using
as follows. Suppose the data owner has a
CAMDP Technique.
private database Dn×n, with each column of
Input: Original Data D.
D being a record and each row an attribute.
Intermediate Result: Noise Matrix.
The data owner generates a n × n noise
Output: Perturbed data stream D ’.
matrix OR , and computes
Steps:
1. Given input data Dn×n .
D’n×n = Dn×n * ORn×n
2. Generate an Orthonal Matrix On×n from
where ORn×n is generated by Translation
the Original Data Dn×n.
and Orthogonal Transformation.
3. Create Translation Matrix Tn×n.
The perturbed data D’n×n is then released for
4. Creat Matrix OTn×n by adding the
future usage. Next we describe the privacy
Translation Matrix Tn×n and Orthonal
application
Matrix On×n.
scenarios
where
orthogonal
transformation can be used to hide the data
5. Generate an Orthonal Matrix (noise
while allowing important patterns to be
matrix) ORn×n from the Matrix OTn×n.
discovered without error.
6. Create Perturbed Dataset D’n×n by
This technique has a nice property that it
multiplying Original Data Dn×n and Noise
preserves vector inner product and distance
Matrix ORn×n.
in Euclidean space. Therefore, any data
7. Release Perturbed Data for Data Miner.
mining algorithms that rely on inner product
8. Stop
or Euclidean distance as a similarity criteria
are invariant to this transformation. Put in
other words, many data mining algorithms
134
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
define the probability of privacy breach as
4 Privacy Breach
Orthogonal
transformation-based
data
follows:
perturbation has the nice property that many
data mining algorithms can be applied to the
Definition 4.2 (Probability of ∈-Privacy
perturbed data and produce exactly the same
Breach) We define ρ(xî, ∈) as the probability
results as if applied to the original data.
that an ∈-privacy breach occurs given that the
However, the issue of how well the original
attacker chose î, i.e., ρ(xî, ∈) = Prob{|| x −xî ||
data is hidden has, to our knowledge, not
≤ || xî || ∈}.
been carefully studied. We take a step in this
direction by assuming the role of an attacker
4.3 Prior Knowledge
armed with three types of prior information
Let the n×m matrix X denote a private
regarding the original data. We examine how
dataset, with each column of X being a
well the attacker can recover the original data
record and each row an attribute. We assume
from
that the attacker knows that transformation
the
perturbed
data
and
prior
information.
function T is an orthogonal transformation
Before stepping into the details of the attack
and knows the perturbed data Y = MTX. In
algorithms, we first give the definition of
most realistic scenarios, the attacker has
privacy breach. We assume that an attacker
some additional prior knowledge which can
will have X and Y and that Y was produced
potentially be used effectively for breaching
from X by an orthogonal transformation. The
privacy. We consider three types of prior
attacker will also have prior knowledge. The
knowledge.
attacker will produce 𝑥 ∈ Rn and 1 ≤ î ≤ m,
Known input-output
where 𝑥 is the attacker’s estimate of xî, the îth
The attacker knows some collection of
data tuple (column) in X.
linearly independent private data records. In
other words, the attacker has a set of linearly
Definition 4.1 (∈-Privacy Breach) For any
independent
input-output
pairs.
In
this
∈ > 0, we say that an ∈-privacy breach occurs
scenario, we can use an attack algorithm
if || x − xî || ≤ || xî ||∈.
based on linear algebra and statistics theory.
Informally stated, an ∈-privacy breach occurs
Known sample The attacker knows that the
if the attacker’s estimate is wrong with
original dataset arose as independent samples
relative error no more than ∈. We further
of some n-dimensional random vector V with
135
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
unknown p.d.f. Also the attacker has another
the attacker can narrow down the space of
collection of independent samples from V.
possibilities for MT to M (Xk, Yk) = {M ∈ On
For technical reasons, we make a mild
: MXk = Yk}.
additional assumption: the covariance matrix
Because the attacker has no additional
of V has distinct eigenvalues. In this
information, any of these matrices is equally
scenario, we can use a principal component
likely to have been MT. The attacker chooses
analysis (PCA)-based attack algorithm.
𝑀 uniformly from M(Xk, Yk) and chooses
Independent signals Each data attribute can
index 1 ≤ î ≤ m−k based on ρ(xî, ∈) (the
be thought of as a time-varying signal. All
probability that an ∈-privacy breach occurs
the signals, at any given time, are statistically
given that î was chosen), then produces 𝑥 =
independent and all the signals are non-
𝑀′ yî = 𝑀′MT xî. Later we will show how the
Gaussian with the exception of one. In this
attacker can compute ρ(xî ,∈) for all 1 ≤ î ≤ m
scenario,
− k from ∈ and Y (known information). Note
we
component
can
analysis
use
an
independent
(ICA)-based
attack
algorithm.
that M(Xk, Yk), in most cases, is uncountable.
As such, more precise definitions are needed
for “choosing 𝑀 uniformly from M(Xk, Yk)”
5. Known Input-Output Attack
and “the probability that || 𝑀′MTx − x|| ≤
Consider the perturbation model
||x||∈ ”.
Y = MTX ⇔
The goal of the attacker is to use the
(Yk Ym−k)= MT (Xk Xm−k).
Let Xk denote the first k columns of X and
perturbed data tuples and known original data
Xm−k the remainder (likewise for Y). We
tuples to produce good estimates of unknown
assume that columns of Xk are all linearly
original data tuples along with links to their
independent and Xk is known to the attacker
(Y is, of course, also known). The attacker
perturbed counterparts. To achieve this, we
will produce 𝑥 and 1 ≤ î ≤ m− k such that 𝑥 is
can use an attack technique called the known
a good estimate of xî, the îth column in Xm−k
input attack which proceeds in three steps.
th
(the (k + î) column in X). If k = n, then the
attacker can recover any column in Xm−k
perfectly as Xm−k = (YkXk−1)′Ym−k. Thus, we
1. The attacker links as many of the known
original data tuples (columns in X) to their
assume k < n. Based on known information,
136
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
corresponding
perturbed
counterparts
chooses
the
one
with
the
maximum
(columns in Y).
probability or chooses all whose probability
2. For each unlinked perturbed data tuple, the
exceeds a threshold, and generates estimates
attacker computes the breach probability of
of their associated known original data
the associated unknown original data tuple.
tuples.
This is the probability that the following
stochastic procedure will result in an accurate
enough estimate of the associated unknown
6. Known Input-Output Attack
Algorithm
As stated earlier, the adversary chooses 𝑀
original data tuple to be considered a privacy
uniformly from M(Xk, Yk) and 1 ≤ î ≤ m − k
breach (the probability calculation is done by
to maximize ρ(xî, ∈).
applying a closed-form expression we derive
Algorithm: Known Input-Output Attack
Technique
later).
(a)
Inputs: Xk, an set of linearly independent
A
Euclidean
distance-preserving
transformation is uniformly chosen from the
columns from X known to the attacker andY
= MTX, known to the attacker, where MT ∈
On is an unknown, and ∈ ≥ 0, known to the
space of such transformations that satisfy the
attacker.
original-perturbed (input-output) constraints
Outputs 1 ≤ î ≤ m−k which maximizes ρ(x î,
from step 1.
∈) and 𝑥 ∈ Rn the corresponding estimate of
(b) The inverse of the chosen transformation
xî.
1: Compute Vk an n × k, orthogonal matrix
is used to estimate original data tuples from
where Col(Vk) = Col(Yk) from Yk using
their perturbed counterparts.
the Gram-Schmidt process.
3. The attacker chooses the perturbed data
2: For each 1 ≤ j ≤ m − k do
3: Compute d(yj, Yk) = ||VkV′kyj − yj || and
tuples which are most vulnerable to breach
||yj||∈.
based their probabilities from step 2, e.g.
4: Compute ρ(xj,∈) using Equation 4.2
5: End For.
137
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
6: Set î ←max1≤j≤m−k{ρ(xj ,∈)}.
Original and Recovered Data
after I/O Attack
7: Choose 𝑀 uniformly from M(Xk, Yk).
8: Set 𝑥← 𝑀′yî .
80
60
40
7. Experimental Result:
20
We have already taken the student record of
0
Vikram University. We have applied the
-20 0
Input/output attack on this original data. With
-40
this data we have generated an orthogonal
matrix with help of Gram- Schmidt process.
20
40
60
-60
-80
Original Data
After this we have calculated the inverse of
Recovered Data
orthogonal matrix and applied on the
Figure 1and 2
perturbed data. we have plotted the graph
with original data , perturbed data and
recovered data. The graph 1 shows original
data and perturbed data and the graph 2
shows the original data and recovered data.
8. Discussion:
It is proved by the above graph that after
applied input/output attack on the data,
Original and Perturbed Data
after CAMDP technique
perturbed by CAMDP technique, the attacker
can not recover the original data. Hence this
140
technique preserve required privacy.
120
100
80
9. Conclusion:
60
In this research paper we examined the
40
CAMDP(Combination
20
Additive
and
Multiplicative Data Perturbation) technique
0
-20 0
-40
of
20
Original Data
40
Perturbed Data
60
for privacy preserving data mining. This
technique
Translation
is
a
and
linear
combination
Distance
of
Preserving
Perturbation. Perturbation techniques are
138
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
often evaluated with two basic metrics: level
10. References
of privacy guarantee and level of model-
[1] R. Agrawal and R. Srikant. “Privacy
specific data utility preserved, which is often
Preserving Data Mining.” In procedding of
measured by the loss of accuracy for data
the
clustering. The experimental results have
management of data, pages 439-450, Dallas,
shown that this technique provides a proper
Texas, May 2000. ACM press.
ACM
SIGMOD
conference
on
degree of privacy. By using this technique,
data owners can share their data with data
[2] M. Kantarchiglu and C. Clifton. “Privacy
miners to find accurate clusters without any
Preserving Distributed Mining of Association
concern about violating data privacy. Using
Rules on Horizontally Partitioned Data.” In
data perturbation algorithm, we generate
SIGMOD workshop on DMKD, Madison,
different perturbed data set. And in the
WI, June 2002.
second step we apply the clustering and
Classification algorithm on perturbed data set.
We carried out set of experiments to generate
[3] H. S. M. Coxeter, Regular Polytopes, 2nd
ed., 1963, ch. XII, pp. 213–217.
clustering and Classification model of original
data set and perturbed data set. Clustering
and Classification results have been evaluated
on accuracy parameters. Proposed algorithms
can
perturb
sensitive
attributes
with
numerical values. Hence this technique offers
[4] G. W. Stewart, “The efficient generation
of random orthogonal matrices with an
application to condition estimation,” SIAM
Journal of Numerical Analysis, vol. 17,
no. 3, pp. 403–409, 1980.
higher privacy protection than the orthogonal
transformation-based
distance
preserving
perturbation, and higher accuracy than
projection based data perturbation for privacy
preserving data mining.
[5] B. Pandya, U.K.Singh and K. Dixit, “An
Analysis of Euclidean Distance Presrving
Perturbation for Privacy Preserving Data
Mining” International Journal for Research in
Applied
Science
and
Engineering
Technology, Vol. 2, Issue X, 2014.
[6] B. Pandya,U.K.Singh and K. Dixit,
“Performance
of
Euclidean
Distance
139
International Journal of Computer Application (2250-1797)
Volume 5– No. 3, April 2015
Presrving
Perturbation
Clustering”
for
International
K-Means
Journal
of
Advanced Scientific and Technical Research,
Vol. 5, Issue 4, pp 282-289, 2014.
[7] B. Pandya,U.K.Singh and K. Dixit,
“Performance
Presrving
Neighbour
of
Euclidean
Perturbation
for
Classification”
Distance
K-Nearest
International
Journal of Computer Application, Vol. 105,
No. 2, pp 34-36, 2014.
140
Related documents