Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Multiplicative Data Perturbations Outline Introduction Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random projection Understanding Distance preservation Perturbation-invariant models Attacks Privacy Evaluation Model Background knowledge and attack analysis Attack-resilient optimization Comparison Summary on additive perturbations problems Weak to various attacks Need to publish noise distribution The column distribution is known Need to develop/revise data mining algorithms in order to utilize perturbed data So far, we have only seen that decision tree and naïve bayes classifier can utilize additive perturbation. Benefits Can be applied to both the Web model and the collaborative data pooling model Low cost More thoughts about perturbation 1. Preserve Privacy Hide the original data not easy to estimate the original values from the perturbed data Protect from data reconstruction techniques The attacker has prior knowledge on the published data 2. Preserve Data Utility for Tasks Single-dimensional info column data distribution, etc. Multi-dimensional info Cov matrix, distance, etc For most PP approaches… Privacy guarantee ? Data Utility/ Model accuracy Privacy guarantee Data utility/ Model accuracy •Difficult to balance the two factors •Subject to attacks •May need new DM algorithms: randomization, cryptographic approaches Multiplicative perturbations Geometric data perturbation (GDP) Rotation data perturbation + Translation data perturbation + Noise addition Random projection perturbation(RPP) Sketch approach Definition of Geometric Data Perturbation G(X) = R*X + T + D R: random rotation T: random translation D: random noise, e.g., Gaussian noise Characteristics: R&T preserving distance, D slightly perturbing distance Example: ID 001 002 age 1158 943 rent 3143 2424 tax -2919 -2297 = .83 -.40 .40 .2 .86 .46 .53 .30 -.79 * ID 001 002 age 30 25 rent 1350 1000 tax 4230 3320 + 12 12 18 18 30 30 + -.4 .29 -1.7 -1.1 .13 1.2 * Each component has its use to enhance the resilience to attacks! Benefits of Geometric Data Perturbation Privacy guarantee decoupled Data Utility/ Model accuracy Make optimization and balancing easier! -Almost fully preserving model accuracy - we optimize privacy only Applicable to many DM algorithms -Distance-based Clustering -Classification: linear, KNN, Kernel, SVM,… Resilient to Attacks -the result of attack research Limitations Multiplicative perturbations are mostly used in outsourcing Cloud computing Can be applied to multiparty collaborative computing in same cases Web model does not fit – perturbation parameters cannot be published Definition of Random Projection Perturbation F(X) = P*X X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m Johnson-Lindenstrauss Lemma There is a random projection F() with e is a small number <1, so that (1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y|| i.e. distance is approximately preserved. Comparison between GDP and RPP Privacy preservation Subject to similar kinds of attacks RPP is more resilience to distance-based attacks Utility preservation(model accuracy) GDP preserves distances well RPP approximately preserves distances Model accuracy is not guaranteed Illustration of multiplicative data perturbation Preserving distances while perturbing each individual dimensions A Model “invariant” to GDP … If distance plays an important role Class/cluster members and decision boundaries are correlated in terms of distance, not the concrete locations 2D Example: Class 1 Class 2 Classification boundary Rotation and translation Classification boundary Distance perturbation (Noise addition) Slightly changed Classification boundary Applicable DM algorithms Modeling methods that depend on Euclidean geometric properties Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms K Nearest Neighbors Kernel methods Linear classifier Support vector machines Most regression models And potentially more … When to Use Multiplicative Data Perturbation Service Provider/data user Data Owner G(X)=RX+T+D Apply F to G(Xnew) G(X) F(G(X), ) Mined models/patterns Good for the outsourcing model. Major issue!! curious service providers/data users try to break G(X) Major issue: attacks! Many existing Privacy Preserving methods are found not so effective when attacks are considered Ex: various data reconstruction algorithms to the random noise addition approach [Huang05][Guo06] Prior knowledge Service provider Y has “PRIOR KNOWLEDGE” about X’s domain and nothing stops Y from using it to infer information in the sanitized data Knowledge used to attack GDP Three levels of knowledge Know nothing naïve estimation Know column distributions Independent Component Analysis Know specific input-output records (original points and their images in perturbed data) distance inference Methodology of attack analysis An attack is an estimate of the original data Original O(x1, x2,…, xn) vs. estimate P(x’1, x’2,…, x’n) How similar are these two series? One of the effective methods is to evaluate the MSE of the estimation – VAR(P-O) or STD(P-O) Two multi-column privacy metrics qi : privacy guarantee for column i qi = std(Pi–Oi), Oi normalized column values, Pi estimated column values Min privacy guarantee: the weakest link of all columns min { qi, i=1..d} Avg privacy guarantee: overall privacy guarantee 1/d qi Alternative metric Based on Agarawal’s information theoretic measure: loss of privacy PI=1- 2-I(X; X^), X^ is the estimation of X I(X; X^) = H(X) – H(X|X^) = H(X) – H(estimation error) Exact estimation H(X|X^) =0, PI = 1-2-H(X) Random estimation I(X; X^) = 0, PI=0 Already normalized for different columns Attack 1: naïve estimation Estimate original points purely based on the perturbed data If using “random rotation” only Intensity of perturbation matters Points around origin Y Class 1 Class 2 Classification boundary X Counter naïve estimation Maximize intensity Based on formal analysis of “rotation intensity” Method to maximize intensity Fast_Opt algorithm in GDP “Random translation” T Hide origin Increase difficulty of attacking! Need to estimate R first, in order to find out T Attack 2: ICA based attacks Independent Component Analysis (ICA) Try to separate R and X from Y= R*X Characteristics of ICA 1. Ordering of dimensions is not preserved. 2. Intensity (value range) is not preserved Conditions of effective ICA-attack 1. Knowing column distribution 2. Knowing value range. Counter ICA attack Weakness of ICA attack Need certain amount of knowledge Cannot effectively handle dependent columns In reality… Most datasets have correlated columns We can find optimal rotation perturbation maximizing the difficulty of ICA attacks Attack 3: distance-inference attack If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping… image Known point Original Perturbed How is the Attack done … Knowing points and their images … find exact images of the known points Enumerate pairs by matched distances … Less effective for large data … we assume pairs are successfully identified Estimation 1. Cancel random translation T from pairs (x, x’) 2. calculate R with pairs: Y=RX R = Y*X-1 3. calculate T with R and known pairs Counter distance-inference: Noise addition Noise brings enough variance in estimation of R and T Now the attacker has to use regression to estimate R Then, use approximate R to estimate T increase uncertainty Regression 1. Cancel random translation T from pairs (x, x’) 2. estimate R with pairs: Y=RX R = (Y*XT )(X*XT)-1 3. Use the estimated R and known pairs to estimate T Discussion Can the noise be easily filtered? Need to know noise distribution, distribution of RX + T, Both distributions are not published, however. Attack analysis will be different from that for noise addition data perturbation Will PCA based noise filtering [Huang05] be effective? What are the best estimation that the attacker can get? If we treat the attack problem as a learning problem - Minimum variance of error for the learner Higher bound of “loss of privacy” Attackers with more knowledge? What if attackers know a large amount of original records? Able to accurately estimate covariance matrix, column distribution, and column range, etc., of the original data Methods PCA, AK_ICA, …,etc can be used What do we do? If you have released so much original information… Stop releasing data anymore A randomized perturbation optimization algorithm Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing method 1. Iteratively determine R - Test on naïve estimation (Fast_opt) - Test on ICA (2nd level) find a better rotation R 2. Append a random translation component 3. Append an appropriate noise component Comparison on methods Privacy preservation In general, RPP should be better than GDP Evaluate the effect of attacks for GDP ICA and distance perturbation need experimental evaluation Utility preservation GDP: R and T exactly preserve distances, The effect of D needs experimental evaluation RPP # of perturbed dimensions vs. utility Datasets 12 datasets from UCI Data Repository Privacy guarantee:GDP In terms of naïve estimation and ICA-based attacks Use only the random rotation and translation (R*X+T) components Optimized for Naïve estimation only Optimized perturbation for both attacks Worst perturbation (no optimization) Privacy guarantee:GDP In terms of distance inference attacks Use all three components (R*X +T+D) Noise D : Gaussian N(0, 2) Assume pairs of (original, image) are identified by attackers no noise addition, privacy guarantee =0 Considerably high PG around small perturbation =0.1 Data utility : GDP with noise addition Noise addition vs. model accuracy noise: N(0, 0.12) Data Utility: RPP Reduced # of dims vs. model accuracy KNN classifiers SVMs Perceptrons