Download dummy

Secure Data Outsourcing Outline     Motivation Background Research issues Summary Motivation  Cost of maintaining/mining large data  4-5 times of the cost of data acquisition  DBAs are paid well   More and more data service providers  Low cost – cloud computing  Maintain one database for one user  multiple users  Examples:     Alentus.com Datapipe.com Discountasp.net …  Concerns about data security and privacy  Untrusted service provider Un-trusted service provider  Lazy: incentives to perform less  Curious: incentives to acquire information  Malicious:  Denial of service  Incorrect results  Possibly compromised Challenges  Data confidentiality  Data need to be encrypted (?)  Utility of protected data?  Query utility  Mining utility  Access pattern privacy  Integrity  Data integrity  Query integrity    Correct Complete Fresh Why is it hard for query services?  Arbitrary expressivity  SQL statements  Often, restricted for certain type of query for simplicity (e.g. range query, knn query)  Cost  Communication  Computation (server side vs client side) Why it is hard for mining services?  Many data mining models  Different utilities to preserve  No one-size-for-all solutions Data confidentiality  Bucketization method (crypto-index)  Order preserving encryption  Perturbations Bucketization method  Hacigumus (SIGMOD02)  Main steps  Partition sensitive attributes  Order preserving: supports comparison  Random: query rewriting becomes hard  Build index on the partitions  Rewrite queries to target partitions  ‘john doe’  105  Select * from T’ where name=105  Execute queries and return results  Prune/post-process results on client  Trade off between confidentiality and overhead  Larger partition  increased privacy  increased overheads Order preserving encryption  Agrawal2004, Boldyreva2009  The set of data is securely transformed so that the order is preserved but the distribution and domain are changed  Benefits: indexing/searching on OPE encrypted data  Weakness: once the original distribution is known, OPE is broken  Not attribute-wise order preserving  Order preserving encryption (OPE, Agrawal et al 2004) is not resilient to distribution-based attacks OPE Original Xi distribution is known Bucket based Estimation Transformed Xi’ distribution Data perturbation  Definition 1. randomly change the original data 2. the attacker cannot effectively recover the original data 3. the desired properties are preserved  Techniques  Single dimension: noise addition  Multidimensional  Geometric perturbation  Random projection  RASP random space perturbation Noise addition  Y = X+ R  X: original data column, R: random noise (distribution published), Y: published data  Applications in data mining  Reconstructing column distribution  Rakesh Agrawal SIGMOD 2000  Applied to privacy-preserving decision tree, naïve bayes classifier  Attacks  Spectral filtering (Kargupta ICDM 2004)  PCA reconstruction (Huang  Multiplicative perturbations  Geometric data perturbation for outsourced data mining  Random Projection  RASP perturbation for query services (range query, kNN query). Perturbation-based framework Mining service Geometric data perturbation  Y=RX+T+D  R: secret rotation matrix (preserve Euclidean distances)  T: secret random translation matrix, D: secret random noise matrix  Distances are approximately preserved (D)  Resilient to most attacks to rotation perturbation  Applications  Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms  Attacks  Population based attacks (when covariance matrix is revealed) Random Projection  Y=AX+D  A: random projection, e.g., entries from N(0,1)  Distances are approximately preserved  Applications  Many classification and clustering algorithms  Worse accuracy than geometric perturbation  Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record)  Attacks  Possibly more resilient than other two perturbation methods  But utility (distance) is not well preserved RASP perturbation k-dimensional numeric data, n records, represented as a k x n matrix, x: a record (1) Extend x to k+2 dimensions - (K+1) th dimension is always 1 – homogeneous dimension (K+2) th dimension v is a real random number drawn from (2) Encryption - A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values - A is shared by all records  Properties  Not an OPE  Preserves convexity of the dataset  Convex dataset in Rk  another convex dataset in Rk+2.  Good for range query  Each range query in Rk  hyperplane based query  range query in Rk+2 . RASP properties  Convexity preserving   Queried range (hypercube) is convex RASP transforms the range to another convex (polyhedron) half space: wTx<=a wTx=a The intersection of convex sets is also convex. illustration of convexity preserving Original space Encrypted space Secure query transformation  A naïve solution  Based on the convexity preserving property Problems: (1) A-1 can be probed (2) is . . If a is known, the whole dimension i is breached. Secure query transformation  Enhanced solution  Xk+2 is always positive  (Xi-a)  0  (Xi-a)Xk+2  0  Correspondingly, in the encrypted space yTy  0, Problems addressed: (1) A-1 cannot be derived from  (2) (Xi-a)Xk+2  0 contains the random component Xk+2 that protects the condition (Xi-a)  0 Efficient two-stage query processing  illustrated Stage2: Filter out the junk records Stage1: Querying this bounding box Original space Transformed space A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server. Stage 1: The client calculates the large bounding box; The server uses the index to find the results. Stage 2: filter the initial results with the conditions yTiy  0 for 1…2k Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory. Otherwise, use linear scan with stage 2 filtering. RASP-based data mining  Preserving range query  linear classifier  Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013) Access pattern privacy  On database queries  Problem is the same as PIR  Attackers may use the access pattern to breach data confidentiality  Each of previous approaches should handle this problem! PIR is impractical  Solutions based on private Information retrieval (PIR)  PIR is still impractical For Bucktization approach  Based on the architecture of Hacigumus (SIGMOD02)  Hore VLDB04  For range query  Privacy concern: reveal the distribution of value in each bucket  “Diffusion”: split buckets and combine parts of different buckets  Trade off: now the server needs to return more noisy results  larger size For OPE  Use queries to find out the distributions, then break the encryption For RASP  Secure query transformation  Attacks to transformed queries Oblivious RAM  Access pattern: read/write data items  Setting:  Client has a small secure memory  Server has large insecure storage, semihonest  Data items are encrypted  Client cannot hide the accessed locations  An active area Existing Approaches Dummy Block Real Block Real Block Dummy Block Real Block Dummy Block Dummy Block Real Block  Inside a level  Some real blocks  Useful data  Some dummy blocks  Random data  Randomly permuted  Only the client knows the permutation Existing Approaches  Reading  Read a block from each level  One real block.  Remaining are dummy blocks dummy real dummy dummy dummy dummy Server Client Existing Approaches Server (before) Client Server (after)  Writing  Shuffle consecutively filled levels.  Write into next unfilled level.  Clear the source levels shuffle blocks Continuous Shuffling … To write:  The Problem with Existing Approaches  Integrity guarantee  Merkle hash tree H(H(x1)+H(x2)) , + is string concatenation Can be stored with tree like structure : index, xml  Hash chains Query correctness with merkle by Devanbu et. al. Using merkle tree Example: 5<=q<=10 LUB(q) = 4 GLB(q) = 11  Operations:  Selections, projections, equijoins, set ops  Issues  Works only on data with verification objects  Query expressiveness  Expensive  Related work  Pang et. al (ICDE04, SIGMOD05), using ElGamal function  Sion VLDB05: challenge token  F.Li SIGMOD06: freshness Secure keyword search  Simple information retrieval  For a keyword, find the documents containing the keyword  What if the documents are encrypted word by word  and if the keyword is also encrypted Secure keyword search  Song 2000 •Seed is random, different for each Wi •Key idea: Li and Ri are selfverifiable •Advantage of XOR How to set K?  Setting of ki  Ki = Fk’(Wi), k’ is secret  User publishes W and k = Fk’(W)  Server checks CiW  whether <Li, Fk(Li)> == CiW It reveals nothing if Ci is not the ciphertext for W. And Li is random for different Wi – server cannot find any information from Li. Hidden search  In previous schemes, W is revealed  Weakness: each search will have to release k for W  Easy to collect information  Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)> Recent developments  Reza 2006  “Searchable symmetric encryption: improved definitions and efficient constructions”  Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA) Trusted hardware Possible benefits Discussion  Data confidentiality/access pattern  Restrict cryptographic definition (keyword search) or  Relaxed definition (perturbation, bucketization, OPE, etc.)  It is very difficult to formulate and prove the security of non-traditional approaches  Do we need to reformulate the security model? and how?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download dummy