Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of Rome “La Sapienza” Computer Science Department Mauro Mezzini ANSWERING SUM-QUERIES : A SECURE AND EFFICIENT APPROACH Introduction Statistical database: users are allowed to ask statistical information such as sum, count, average, max and min queries on a numerical attribute. Retail PRODUCT storage router server mainframe select sum( SALES ) from Retail where PRODUCT = “storage” or PRODUCT = “router”; SALES(€) 90000 30000 30000 25000 r = 120.000 Introduction Definition: The target K of a query q. select sum( SALES ) from Retail where PRODUCT = “storage” or PRODUCT = “router”; K PRODUCT storage router The efficiency issue To speed up the answer of a sum-query, the query system is endowed with a set of pre-computed sum-queries called the set of materialised views. q1 q2 q select sum( SALES ) from Retail select sum( SALES ) from Retail where PRODUCT = “storage” or PRODUCT = “router”; select sum( SALES ) from Retail where PRODUCT = “server” or PRODUCT = “mainframe”; r1= 175.000 r2= 120.000 r = r1 - r2= 55.000 Protection issue To protect the confidentiality of the numerical attribute, the query system is endowed with the list of all sensitive categories. PRODUCT storage routers server mainframe SALES(€) 90000 30000 30000 25000 q1 select sum( SALES) from Retail where PRODUCT = “storage”; q2 select sum( SALES) from Retail where PRODUCT = “router”; Protection issue q1 select sum( SALES) from Retail where PRODUCT = “router” or PRODUCT = “server”; q2 select sum( SALES) from Retail where PRODUCT = “storage” or PRODUCT = “server”; q3 select sum( SALES) from Retail where PRODUCT = “storage” or PRODUCT = “router”; r1= 120.000 r2= 60.000 r3 =120.000 x1 + x2 = r 1 x2 + x3 = r 2 x1 + x3 = r 3 The value of all confidential information can be inferred from the answer of non–confidential queries {q1, q2, q3 }. The inference model Efficiency : Given a set of sum-queries V = {q1,…,qn} determine if the result of q can be inferred from V. Protection : Given a set of sum-queries V = {q1,…,qn} determine for every inferable sum-query q if the result of q is a sensitive information. The inference model Let V = {q1, q2, …,qn} Let Ki and ri be the target and the result of qi respectively Let P={C1, C2,…, Cm} be the coarsest partition of i=1,…,n Ki such that each Ki can be obtained by the union of one or more elements of P. The inference model is based on the following linear constraints system j=1,…,m ai,j xj = ri i=1,…,n (1) xFm where ai,j = 1 if CjKi and ai,j = 0 otherwise and F is the domain of the numerical attribute The inference model. An example q1 select sum( SALES) from Retail where PRODUCT = “router” or PRODUCT = “server”; K1={router, server} r1= 120.000 q2 select sum( SALES) from Retail where PRODUCT = “storage” or PRODUCT = “server”; K2={storage, server} r2= 60.000 q3 select sum( SALES) from Retail where PRODUCT = “storage” or PRODUCT = “router”; K3={storage, router} r3 =120.000 C1={router} C2={server} C3={storage} x1 + x2 = r1 x2 + x3 = r2 x1 + x3 = r3 F is the set of non-negative reals The inference model Definition: Given a subset S of {1,2,…,m} the sum-expression jS xj is an F-invariant if it takes on the same value for every solution x of (1). An F-invariant sum is the result of the sum-query with target jS Cj The inference model Definitions: Given the partition P = {C1,…,Cm} and a query q with target K the two sets: S = { j : Cj K} the support of q S = { j : Cj K and Cj - K } the cosupport of q The sum jSS xj is called the sum-expression associated to q. The inference model. An example q1 select sum( SALES) from Retail where PRODUCT = “router” or PRODUCT = “server”; K1={router, server} r1= 120.000 q2 select sum( SALES) from Retail where PRODUCT = “storage” or PRODUCT = “server”; K2={storage, server} r2= 60.000 q3 select sum( SALES) from Retail where PRODUCT = “storage” or PRODUCT = “router”; K3={storage, router} r3 =120.000 C1={router} C2={server} C3={storage} q x1 + x2 = r1 x2 + x3 = r2 x1 + x3 = r3 select sum( SALES) from Retail where PRODUCT = “storage”; K={storage} The support of q is { 3 } , the cosupport is empty and the sum-expression associated to q is trivially: x3 Problems definitions 1) Given a sum-expression jS xj decide whether it is an Finvariant. 2) Given a sum-expression jS xj that is not an F-invariant, find a nonempty subset S of S such that jS xj is an Finvariant. Problem (2) Let S be a subset of {1,…,m} and let s be the characteristic vector of S. Then s(i)= 1 if iS 0 if iS i = 1,…,m Problem (2) We can rewrite system (1) as : A x = r, xFm An m-dimensional f vector is a linear combination of rows of A if f = i=1,…,m ai ai aiR ai is a row of A i=1,…,m Problem (2) Definition: A subset S of {1,2,…,m} is said to be algebraic if its characteristic vector can be expressed as a linear combination of the rows of the matrix A. If F is R , or Z then jS xj is F-invariant if and only if S is algebraic. The NAS Problem Problem definition :Given a sum expression jS xj that is not R-invariant, find a non-empty algebraic subset of S (NAS Problem). NAS Problem : find a non-empty subset F of S such that the characteristic vector of F is expressible as a linear combination of rows of A The NAS Problem The subset sum problem (SSP): Given a set S = {1,…,p} and a mapping a:S Z such that a(i) > 0 for i=1,…,p-1 and a(i) < 0 for i=p find a subset F of S such that iF a(i) = 0 The NAS Problem Let c be a q-dimensional vector, with q≥p, such that c(1) = a(1) c(2) = a(2) …. c(p) = a(p) and c(j) R for p<jq Let M = (I, c) be the q (q+1) matrix obtained from c. The NAS Problem Example: let S={1, 2, 3, 4} and a(1) = 1 a(2) = 2 a(3) = 5 a(4)= -7 The subset F = { 2, 3, 4} of S is a solution of the SSP since a(2) + a(3) + a(4) = 2 + 5 – 7 = 0. The NAS Problem If we chose q = 5 the vector c is (1, 2, 5, -7, a ) and the matrix M is 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 2 5 -7 a The NAS Problem The vector c= ( -c , 1) is a solution of the equation M y=0 y1 +1 y6 = 0 y2 +2 y6 = 0 y3 +5 y6 = 0 y4 -7 y6 = 0 y5 +a y6 = 0 The NAS Problem Theorem: Given the matrix M and the set S = {1,…,p} then the SSP as a solution if and only if there exist a nonempty algebraic subset of S. Proof The (q+1)-dimensional vector c= ( -c , 1) spans the null space of M My=0 and the null space of M has dimension equal to one. The NAS Problem If FS is an algebraic set then its characteristic vector f is expressible as a linear combination of rows of M. Since f and c are orthogonal then i=1,…,q+1 f(i) c(i) = 0 that is 0 = iF c(i) = - iF a(i) qed. The NAS Problem Example: let S={1, 2, 3} and a(1) = 2 a(2) = 2 then c0 = (2 , 2, -4) c1 = (-1, -1, 1) c2 = (-1, -1, 1) c3 = ( 2, 2, -2, 1, 1) let c = (c0, c1, c2, c3) a(3) = -4 The NAS Problem Then M would be 10000000000000 2 01000000000000 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 -1 00000100000000 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 -1 00000000100000 1 00000000010000 2 00000000001000 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2 00000000000010 1 00000000000001 1 c0 c1 c2 c3 The NAS Problem Step (1) 10011000000000 0 01000000000000 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 -1 00000100000000 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 -1 00000000100000 1 00000000010000 2 00000000001000 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2 00000000000010 1 00000000000001 1 c0 c1 c2 c3 The NAS Problem Step (3) 10011000000000 0 01000000000000 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4 00010100000000 0 00001100000000 0 00000100000000 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 -1 00000000100000 1 00000000010000 2 00000000001000 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2 00000000000010 1 00000000000001 1 c0 c1 c2 c3 The NAS Problem Step (4) 10011000000000 0 01000011000000 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4 00010100000000 0 00001100000000 0 00000100000000 1 00000010100000 0 00000001100000 0 00000000100000 1 00000000010000 2 00000000001000 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2 00000000000010 1 00000000000001 1 c0 c1 c2 c3 The NAS Problem Step (5) 10011000000000 0 01000011000000 0 00100000011000 0 00010100000000 0 00001100000000 0 00000100000000 1 00000010100000 0 00000001100000 0 00000000100000 1 00000000010000 2 00000000001000 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2 00000000000010 1 00000000000001 1 c0 c1 c2 c3 The NAS Problem Step (6) 10011000000000 0 01000011000000 0 00100000011000 0 00010100000000 0 00001100000000 0 00000100000000 1 00000010100000 0 00000001100000 0 00000000100000 1 00000000010100 0 00000000001100 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2 00000000000010 1 00000000000001 1 c0 c1 c2 c3 The NAS Problem Final step 10011000000000 01000011000000 00100000011000 00010100000000 00001100000000 00000100000000 00000010100000 00000001100000 00000000100000 00000000010010 00000000001010 00000000000111 00000000000010 00000000000001 0 0 0 0 0 1 0 0 1 0 0 0 1 1 c0 c1 c2 c3 The NAS Problem a(i) > 1 ci = ( i=1,..,p-1 - a(i ) - a(i ) a(i ) - a(i ) - a(i ) a(i ) - a(i ) - a(i ) a(i ) , , , , , ,..., , , 2 2 2 2 2 2 2 2 2 2 ki 2 ki 2 ki ki = log2 a(i) ) The NAS Problem a(i) = 7 ci = ( -3, -3, 3, -1, -1, 1 ) ki = log2 7 = 2 a(i) = 8 ci = ( -4, -4, 4, -2, -2, 2, -1, -1, 1 ) ki = log2 8 = 3 The NAS Problem B = max{ |a(i)| : i = 1,…,p} The SSP has input dimension equal to O( p × log2(B)). ki log2(B) The dimension of the matrix M is q × (q +1) where q ( p + 1 ) × 3 log2(B) O( p × log2(B) ) Solving problem (1) A x = r, xFm jS xj is an F-invariant? A is the vertex-edge incidence matrix of a graph, F is the set of reals and S is singleton. r2 x1 r1 x6 x3 x2 r3 r5 x5 x4 r4 x7 x8 r6 Solving problem (1) Consider the homogeneous system associated to system (1) A y = 0, yRm (2) We call circulation a solution y of system (2). 0 +a 0 0 0 -a 0 0 0 -a +a 0 0 0 Solving problem (1) Definition : given a circulation y then its support is the set C = { e : ye 0 } +a 0 -a 0 -a +a 0 0 Solving problem (1) Theorem 1: The unknown xe is an R-invariant if and only if circulation y with support C then eC. Proof: Let x* be a particular solution of (1). Then x = x* + y So if ye=0, circulation y then xe = xe*, solution x of (1). If xe is invariant then xe – xe* = 0 = ye For every solution x of (1). Therefore ye = 0 for every circulation y. Solving problem (1) Definition : A circulation y with support C is minimal if there is no circulation with support C such that CC. +a +a -a +a -2a -2a +3a -a Solving problem (1) The support of minimal circulations are called circuits and are the even cycles and the L-oddsets of the graph. +a +a -a -a +a +2a -a -a -2a +a -a +a +a -a -a +a +a -a Solving problem (1) Given a circulation y then y = i=1,…,pai yi where ai R B={y1,…, yp} is a base of N each yi is a circuit of G Solving problem (1) +β +a -a +a -a -β -a +2a -a -β +β Solving problem (1) Theorem 2: The unknown xe is an R-invariant if and only if circuit yi with support C then eC. Proof: ye= i=1,…,pai yi,e = 0 Solving problem (1) An odd edge is an edge of G belonging to every odd cycles of G. A bridge is an edge of G whose removal disconnect G. Solving problem (1) Theorem 3: The unknown xe is an R-invariant if and only if e is an odd edge or is a bridge that disconnect a bipartite component of G. Proof: 1) If e belongs to all odd cycles of G then G cannot contains an l-oddset. 2) If e is a bridge then it cannot belong to an even cycle. Solving problem (1) The case when e is an odd edge. Let for contraddiction D be an even cycle containing e. D C is a set of edge-disjoint cycles not containing e. |D C| = |D| +|C| - 2 |D C| |D C| is odd and D C must contains at least one odd cycle (contraddiction). Solving problem (1) The case when e is a bridge disconnecting a bipartite component. e non bipartite component bipartite component Solving problem (1) E(H) = { e : e is a bridge of G} V(H) = { v : v is a connected component of G-E(H)} G H Solving problem (1) Step 1 Solving problem (1) Step 2 Solving problem (1) Step 3 Solving problem (1) Step 4 Solving problem (1) Step 5 Solving problem (1) Step 6 Solving problem (1) Step 7 Solving problem (1) Step 8 Solving problem (1) A DFS traversal of a graph gives a partition of the edges of G tree edges back edges Each back edge e generates a cycle C(e) The cycle C(e) is called a fundamental cycle with respect to the tree T Solving problem (1) Proposition: every cycle of G can be obtained as the symmetric difference of one or more fundamental cycles. If e is an odd edge then 1) it must belong to every fundamental odd cycle of G 1) no fundamental even cycle of G contains e Solving problem (1) A back edge e belong to every fundamental odd cycle of G if and only if C(e) is the only fundamental odd cycle. For every tree edge e we count the number of odd and even fundamental cycles containing e.