Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 Impact of known Input Output attack in CAMDP technique for privacy Preserving Data Mining Bhupendra Kumar Pandya Umesh kumar Singh Keerti Dixit Institute of Computer Science Institute of Computer Science Institute of Computer Science Vikram University, Ujjain Vikram University, Ujjain Vikram University, Ujjain [email protected] [email protected] [email protected] Abstract: Privacy preservation has become a major issue in many data mining applications. When a data set is released to other parties for data mining, some privacy-preserving technique is often required to reduce the possibility of identifying sensitive information about individuals. Many data mining applications deal with privacy sensitive data. Financial transactions, health-care records, and network communication traffic are some examples. Data mining in such privacysensitive domains is facing growing concerns. Therefore, we need to develop data mining techniques that are sensitive to the privacy issue. This research paper considers a CAMDP (Combination of Additive and Multiplicative Data Perturbation) technique for privacy preserving data mining. This technique explores the possibility for constructing a new representation of the data. It can be proved that the CAMDP technique for Privacy Preserving Data Mining can be applied for several categories of popular data mining models with better utility preservation and privacy preservation. This research paper presents extensive theoretical analysis and experimental results on the accuracy and privacy of the CAMDP technique for privacy preserving data mining. We examine how well the attacker can recover the original data from the transformed data and prior information. Keyword:- CAMDP, I/O Attack 132 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 Perturbation 1. Introduction Privacy and security, particularly maintaining confidentiality of data, have become a challenging issue with advances in techniques. This Method combines the strength of the translation and distance preserving method. 2.1. Translation Based Perturbation information and communication technology. The ability to communicate and share data has many benefits. Progress in scientific research depends on availability and sharing of information and ideas. But protecting the privacy of human participant is given top In TBP method, the observations of confidential attributes are perturbed using an additive noise perturbation. Here we apply the noise term applied for each confidential attribute which is constant and value can be either positive or negative. priority by the researcher. Therefore, we need to develop data mining 2.2. techniques that are sensitive to the privacy To issue. This has fostered the development of transformation [3-7], let us start with the a class of data mining algorithms [1,2] that definition of metric space. In mathematics, a try to extract the data pattern without metric space is a set S with a global distance directly accessing the original data and function (the metric d) that, for every two guarantees that the mining process does not points x, y in S, gives the distance between get sufficient information to reconstruct the them as a nonnegative real number d(x, y). original data. Usually, we denote a metric space by a 2- In this paper, we multidimensional analyze data technique: CAMDP Additive and a new perturbation (Combination Multiplicative of Data Distance Based Perturbation define 2. d(x, y) = d(y, x) (symmetry), 3. d(x, y) + d(y, z) ≥ d(x, z) (triangle Mining. 2.3. Many Generation Matrix matrix The CAMDP technique is a Combination of orthogonal Additive decomposition, Multiplicative Data preserving 1. d(x, y) = 0 iff x = y (identity), inequality). and distance tuple (S, d). A metric space must also satisfy Perturbation) for Privacy Preserving Data 2. CAMDP Technique: the of Orthogonal decompositions matrices, such SVD, involve as QR spectral 133 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 decomposition and polar decomposition. To can be applied to the transformed data and generate a uniformly distributed random produce exactly the same results as if orthogonal matrix, we usually fill a matrix applied to the original data, e.g., KNN with independent Gaussian random entries, classifier, then use QR decomposition. vector machine, distance-based clustering perception learning, support and outlier detection. 2.4. Data Perturbation Model Translation and Orthogonal transformation- 3. CAMDP Algorithm based data perturbation can be implemented Algorithm: Privacy Preserving using as follows. Suppose the data owner has a CAMDP Technique. private database Dn×n, with each column of Input: Original Data D. D being a record and each row an attribute. Intermediate Result: Noise Matrix. The data owner generates a n × n noise Output: Perturbed data stream D ’. matrix OR , and computes Steps: 1. Given input data Dn×n . D’n×n = Dn×n * ORn×n 2. Generate an Orthonal Matrix On×n from where ORn×n is generated by Translation the Original Data Dn×n. and Orthogonal Transformation. 3. Create Translation Matrix Tn×n. The perturbed data D’n×n is then released for 4. Creat Matrix OTn×n by adding the future usage. Next we describe the privacy Translation Matrix Tn×n and Orthonal application Matrix On×n. scenarios where orthogonal transformation can be used to hide the data 5. Generate an Orthonal Matrix (noise while allowing important patterns to be matrix) ORn×n from the Matrix OTn×n. discovered without error. 6. Create Perturbed Dataset D’n×n by This technique has a nice property that it multiplying Original Data Dn×n and Noise preserves vector inner product and distance Matrix ORn×n. in Euclidean space. Therefore, any data 7. Release Perturbed Data for Data Miner. mining algorithms that rely on inner product 8. Stop or Euclidean distance as a similarity criteria are invariant to this transformation. Put in other words, many data mining algorithms 134 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 define the probability of privacy breach as 4 Privacy Breach Orthogonal transformation-based data follows: perturbation has the nice property that many data mining algorithms can be applied to the Definition 4.2 (Probability of ∈-Privacy perturbed data and produce exactly the same Breach) We define ρ(xî, ∈) as the probability results as if applied to the original data. that an ∈-privacy breach occurs given that the However, the issue of how well the original attacker chose î, i.e., ρ(xî, ∈) = Prob{|| x −xî || data is hidden has, to our knowledge, not ≤ || xî || ∈}. been carefully studied. We take a step in this direction by assuming the role of an attacker 4.3 Prior Knowledge armed with three types of prior information Let the n×m matrix X denote a private regarding the original data. We examine how dataset, with each column of X being a well the attacker can recover the original data record and each row an attribute. We assume from that the attacker knows that transformation the perturbed data and prior information. function T is an orthogonal transformation Before stepping into the details of the attack and knows the perturbed data Y = MTX. In algorithms, we first give the definition of most realistic scenarios, the attacker has privacy breach. We assume that an attacker some additional prior knowledge which can will have X and Y and that Y was produced potentially be used effectively for breaching from X by an orthogonal transformation. The privacy. We consider three types of prior attacker will also have prior knowledge. The knowledge. attacker will produce 𝑥 ∈ Rn and 1 ≤ î ≤ m, Known input-output where 𝑥 is the attacker’s estimate of xî, the îth The attacker knows some collection of data tuple (column) in X. linearly independent private data records. In other words, the attacker has a set of linearly Definition 4.1 (∈-Privacy Breach) For any independent input-output pairs. In this ∈ > 0, we say that an ∈-privacy breach occurs scenario, we can use an attack algorithm if || x − xî || ≤ || xî ||∈. based on linear algebra and statistics theory. Informally stated, an ∈-privacy breach occurs Known sample The attacker knows that the if the attacker’s estimate is wrong with original dataset arose as independent samples relative error no more than ∈. We further of some n-dimensional random vector V with 135 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 unknown p.d.f. Also the attacker has another the attacker can narrow down the space of collection of independent samples from V. possibilities for MT to M (Xk, Yk) = {M ∈ On For technical reasons, we make a mild : MXk = Yk}. additional assumption: the covariance matrix Because the attacker has no additional of V has distinct eigenvalues. In this information, any of these matrices is equally scenario, we can use a principal component likely to have been MT. The attacker chooses analysis (PCA)-based attack algorithm. 𝑀 uniformly from M(Xk, Yk) and chooses Independent signals Each data attribute can index 1 ≤ î ≤ m−k based on ρ(xî, ∈) (the be thought of as a time-varying signal. All probability that an ∈-privacy breach occurs the signals, at any given time, are statistically given that î was chosen), then produces 𝑥 = independent and all the signals are non- 𝑀′ yî = 𝑀′MT xî. Later we will show how the Gaussian with the exception of one. In this attacker can compute ρ(xî ,∈) for all 1 ≤ î ≤ m scenario, − k from ∈ and Y (known information). Note we component can analysis use an independent (ICA)-based attack algorithm. that M(Xk, Yk), in most cases, is uncountable. As such, more precise definitions are needed for “choosing 𝑀 uniformly from M(Xk, Yk)” 5. Known Input-Output Attack and “the probability that || 𝑀′MTx − x|| ≤ Consider the perturbation model ||x||∈ ”. Y = MTX ⇔ The goal of the attacker is to use the (Yk Ym−k)= MT (Xk Xm−k). Let Xk denote the first k columns of X and perturbed data tuples and known original data Xm−k the remainder (likewise for Y). We tuples to produce good estimates of unknown assume that columns of Xk are all linearly original data tuples along with links to their independent and Xk is known to the attacker (Y is, of course, also known). The attacker perturbed counterparts. To achieve this, we will produce 𝑥 and 1 ≤ î ≤ m− k such that 𝑥 is can use an attack technique called the known a good estimate of xî, the îth column in Xm−k input attack which proceeds in three steps. th (the (k + î) column in X). If k = n, then the attacker can recover any column in Xm−k perfectly as Xm−k = (YkXk−1)′Ym−k. Thus, we 1. The attacker links as many of the known original data tuples (columns in X) to their assume k < n. Based on known information, 136 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 corresponding perturbed counterparts chooses the one with the maximum (columns in Y). probability or chooses all whose probability 2. For each unlinked perturbed data tuple, the exceeds a threshold, and generates estimates attacker computes the breach probability of of their associated known original data the associated unknown original data tuple. tuples. This is the probability that the following stochastic procedure will result in an accurate enough estimate of the associated unknown 6. Known Input-Output Attack Algorithm As stated earlier, the adversary chooses 𝑀 original data tuple to be considered a privacy uniformly from M(Xk, Yk) and 1 ≤ î ≤ m − k breach (the probability calculation is done by to maximize ρ(xî, ∈). applying a closed-form expression we derive Algorithm: Known Input-Output Attack Technique later). (a) Inputs: Xk, an set of linearly independent A Euclidean distance-preserving transformation is uniformly chosen from the columns from X known to the attacker andY = MTX, known to the attacker, where MT ∈ On is an unknown, and ∈ ≥ 0, known to the space of such transformations that satisfy the attacker. original-perturbed (input-output) constraints Outputs 1 ≤ î ≤ m−k which maximizes ρ(x î, from step 1. ∈) and 𝑥 ∈ Rn the corresponding estimate of (b) The inverse of the chosen transformation xî. 1: Compute Vk an n × k, orthogonal matrix is used to estimate original data tuples from where Col(Vk) = Col(Yk) from Yk using their perturbed counterparts. the Gram-Schmidt process. 3. The attacker chooses the perturbed data 2: For each 1 ≤ j ≤ m − k do 3: Compute d(yj, Yk) = ||VkV′kyj − yj || and tuples which are most vulnerable to breach ||yj||∈. based their probabilities from step 2, e.g. 4: Compute ρ(xj,∈) using Equation 4.2 5: End For. 137 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 6: Set î ←max1≤j≤m−k{ρ(xj ,∈)}. Original and Recovered Data after I/O Attack 7: Choose 𝑀 uniformly from M(Xk, Yk). 8: Set 𝑥← 𝑀′yî . 80 60 40 7. Experimental Result: 20 We have already taken the student record of 0 Vikram University. We have applied the -20 0 Input/output attack on this original data. With -40 this data we have generated an orthogonal matrix with help of Gram- Schmidt process. 20 40 60 -60 -80 Original Data After this we have calculated the inverse of Recovered Data orthogonal matrix and applied on the Figure 1and 2 perturbed data. we have plotted the graph with original data , perturbed data and recovered data. The graph 1 shows original data and perturbed data and the graph 2 shows the original data and recovered data. 8. Discussion: It is proved by the above graph that after applied input/output attack on the data, Original and Perturbed Data after CAMDP technique perturbed by CAMDP technique, the attacker can not recover the original data. Hence this 140 technique preserve required privacy. 120 100 80 9. Conclusion: 60 In this research paper we examined the 40 CAMDP(Combination 20 Additive and Multiplicative Data Perturbation) technique 0 -20 0 -40 of 20 Original Data 40 Perturbed Data 60 for privacy preserving data mining. This technique Translation is a and linear combination Distance of Preserving Perturbation. Perturbation techniques are 138 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 often evaluated with two basic metrics: level 10. References of privacy guarantee and level of model- [1] R. Agrawal and R. Srikant. “Privacy specific data utility preserved, which is often Preserving Data Mining.” In procedding of measured by the loss of accuracy for data the clustering. The experimental results have management of data, pages 439-450, Dallas, shown that this technique provides a proper Texas, May 2000. ACM press. ACM SIGMOD conference on degree of privacy. By using this technique, data owners can share their data with data [2] M. Kantarchiglu and C. Clifton. “Privacy miners to find accurate clusters without any Preserving Distributed Mining of Association concern about violating data privacy. Using Rules on Horizontally Partitioned Data.” In data perturbation algorithm, we generate SIGMOD workshop on DMKD, Madison, different perturbed data set. And in the WI, June 2002. second step we apply the clustering and Classification algorithm on perturbed data set. We carried out set of experiments to generate [3] H. S. M. Coxeter, Regular Polytopes, 2nd ed., 1963, ch. XII, pp. 213–217. clustering and Classification model of original data set and perturbed data set. Clustering and Classification results have been evaluated on accuracy parameters. Proposed algorithms can perturb sensitive attributes with numerical values. Hence this technique offers [4] G. W. Stewart, “The efficient generation of random orthogonal matrices with an application to condition estimation,” SIAM Journal of Numerical Analysis, vol. 17, no. 3, pp. 403–409, 1980. higher privacy protection than the orthogonal transformation-based distance preserving perturbation, and higher accuracy than projection based data perturbation for privacy preserving data mining. [5] B. Pandya, U.K.Singh and K. Dixit, “An Analysis of Euclidean Distance Presrving Perturbation for Privacy Preserving Data Mining” International Journal for Research in Applied Science and Engineering Technology, Vol. 2, Issue X, 2014. [6] B. Pandya,U.K.Singh and K. Dixit, “Performance of Euclidean Distance 139 International Journal of Computer Application (2250-1797) Volume 5– No. 3, April 2015 Presrving Perturbation Clustering” for International K-Means Journal of Advanced Scientific and Technical Research, Vol. 5, Issue 4, pp 282-289, 2014. [7] B. Pandya,U.K.Singh and K. Dixit, “Performance Presrving Neighbour of Euclidean Perturbation for Classification” Distance K-Nearest International Journal of Computer Application, Vol. 105, No. 2, pp 34-36, 2014. 140