Download 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0

Document related concepts
no text concepts found
Transcript
University of Rome
“La Sapienza”
Computer Science Department
Mauro Mezzini
ANSWERING SUM-QUERIES :
A SECURE AND EFFICIENT
APPROACH
Introduction
Statistical database: users are allowed to ask statistical information
such as sum, count, average, max and min queries on a numerical
attribute.
Retail
PRODUCT
storage
router
server
mainframe
select sum( SALES )
from Retail
where PRODUCT = “storage”
or PRODUCT = “router”;
SALES(€)
90000
30000
30000
25000
r = 120.000
Introduction
Definition: The target K of a query q.
select sum( SALES )
from Retail
where PRODUCT = “storage”
or PRODUCT = “router”;
K
PRODUCT
storage
router
The efficiency issue
To speed up the answer of a sum-query, the query system is
endowed with a set of pre-computed sum-queries called the set of
materialised views.
q1
q2
q
select sum( SALES )
from Retail
select sum( SALES )
from Retail
where PRODUCT = “storage”
or PRODUCT = “router”;
select sum( SALES )
from Retail
where PRODUCT = “server”
or PRODUCT = “mainframe”;
r1= 175.000
r2= 120.000
r = r1 - r2= 55.000
Protection issue
To protect the confidentiality of the numerical attribute, the query
system is endowed with the list of all sensitive categories.
PRODUCT
storage
routers
server
mainframe
SALES(€)
90000
30000
30000
25000
q1
select sum( SALES) from Retail
where PRODUCT = “storage”;
q2
select sum( SALES) from Retail
where PRODUCT = “router”;
Protection issue
q1
select sum( SALES) from Retail
where PRODUCT = “router”
or PRODUCT = “server”;
q2
select sum( SALES) from Retail
where PRODUCT = “storage”
or PRODUCT = “server”;
q3
select sum( SALES) from Retail
where PRODUCT = “storage”
or PRODUCT = “router”;
r1= 120.000
r2= 60.000
r3 =120.000
x1 + x2 = r 1
x2 + x3 = r 2
x1 + x3 = r 3
The value of all confidential information can be inferred from the
answer of non–confidential queries {q1, q2, q3 }.
The inference model
Efficiency :
Given a set of sum-queries V = {q1,…,qn} determine if the
result of q can be inferred from V.
Protection :
Given a set of sum-queries V = {q1,…,qn} determine for
every inferable sum-query q if the result of q is a sensitive
information.
The inference model
Let V = {q1, q2, …,qn}
Let Ki and ri be the target and the result of qi respectively
Let P={C1, C2,…, Cm} be the coarsest partition of i=1,…,n Ki such that each
Ki can be obtained by the union of one or more elements of P.
The inference model is based on the following linear constraints system
j=1,…,m ai,j xj = ri
i=1,…,n
(1)
xFm
where ai,j = 1 if CjKi and ai,j = 0 otherwise
and F is the domain of the numerical attribute
The inference model. An example
q1
select sum( SALES) from Retail
where PRODUCT = “router”
or PRODUCT = “server”;
K1={router, server}
r1= 120.000
q2
select sum( SALES) from Retail
where PRODUCT = “storage”
or PRODUCT = “server”;
K2={storage, server}
r2= 60.000
q3
select sum( SALES) from Retail
where PRODUCT = “storage”
or PRODUCT = “router”;
K3={storage, router}
r3 =120.000
C1={router}
C2={server}
C3={storage}
x1 + x2 = r1
x2 + x3 = r2
x1 + x3 = r3
F is the set of non-negative reals
The inference model
Definition: Given a subset S of {1,2,…,m} the sum-expression
jS xj
is an F-invariant if it takes on the same value for every solution
x of (1).
An F-invariant sum is the result of the sum-query with target
jS Cj
The inference model
Definitions: Given the partition P = {C1,…,Cm} and a query q
with target K the two sets:
S = { j : Cj  K}
the support of q
S = { j : Cj  K   and Cj - K  }
the cosupport of q
The sum
jSS xj
is called the sum-expression associated to q.
The inference model. An example
q1
select sum( SALES) from Retail
where PRODUCT = “router”
or PRODUCT = “server”;
K1={router, server}
r1= 120.000
q2
select sum( SALES) from Retail
where PRODUCT = “storage”
or PRODUCT = “server”;
K2={storage, server}
r2= 60.000
q3
select sum( SALES) from Retail
where PRODUCT = “storage”
or PRODUCT = “router”;
K3={storage, router}
r3 =120.000
C1={router}
C2={server}
C3={storage}
q
x1 + x2 = r1
x2 + x3 = r2
x1 + x3 = r3
select sum( SALES) from Retail
where PRODUCT = “storage”;
K={storage}
The support of q is { 3 } , the cosupport is empty and the sum-expression
associated to q is trivially:
x3
Problems definitions
1) Given a sum-expression jS xj decide whether it is an Finvariant.
2) Given a sum-expression jS xj that is not an F-invariant,
find a nonempty subset S of S such that jS xj is an Finvariant.
Problem (2)
Let S be a subset of {1,…,m} and let s be the characteristic
vector of S. Then
s(i)=
1
if iS
0
if iS
i = 1,…,m
Problem (2)
We can rewrite system (1) as : A x = r, xFm
An m-dimensional f vector is a linear combination of rows
of A if
f = i=1,…,m ai ai
aiR
ai is a row of A
i=1,…,m
Problem (2)
Definition: A subset S of {1,2,…,m} is said to be algebraic if
its characteristic vector can be expressed as a linear
combination of the rows of the matrix A.
If F is R , or Z then jS xj is F-invariant if and
only if S is algebraic.
The NAS Problem
Problem definition :Given a sum expression
jS xj
that is not R-invariant, find a non-empty algebraic subset of S
(NAS Problem).
NAS Problem : find a non-empty subset F of S such that the
characteristic vector of F is expressible as a linear combination of
rows of A
The NAS Problem
The subset sum problem (SSP):
Given a set S = {1,…,p} and a mapping
a:S  Z
such that
a(i) > 0 for i=1,…,p-1 and
a(i) < 0 for i=p
find a subset F of S such that
iF a(i) = 0
The NAS Problem
Let c be a q-dimensional vector, with q≥p, such that
c(1) = a(1)
c(2) = a(2)
….
c(p) = a(p)
and
c(j) R
for p<jq
Let M = (I, c) be the q  (q+1) matrix obtained from c.
The NAS Problem
Example: let S={1, 2, 3, 4} and
a(1) = 1
a(2) = 2
a(3) = 5
a(4)= -7
The subset F = { 2, 3, 4} of S is a solution of the SSP since
a(2) + a(3) + a(4) = 2 + 5 – 7 = 0.
The NAS Problem
If we chose q = 5 the vector c is (1, 2, 5, -7, a ) and the
matrix M is
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
1
2
5
-7
a
The NAS Problem
The vector c= ( -c , 1) is a solution of the equation
M y=0
y1
+1 y6 = 0
y2
+2 y6 = 0
y3
+5 y6 = 0
y4
-7 y6 = 0
y5 +a y6 = 0
The NAS Problem
Theorem: Given the matrix M and the set S = {1,…,p} then the
SSP as a solution if and only if there exist a nonempty algebraic
subset of S.
Proof
The (q+1)-dimensional vector c= ( -c , 1) spans the null space
of M
My=0
and the null space of M has dimension equal to one.
The NAS Problem
If FS is an algebraic set then its characteristic vector f is
expressible as a linear combination of rows of M. Since f and c
are orthogonal then
i=1,…,q+1 f(i) c(i) = 0
that is
0 = iF c(i) = - iF a(i)
qed.
The NAS Problem
Example: let S={1, 2, 3} and
a(1) = 2
a(2) = 2
then
c0 = (2 , 2, -4)
c1 = (-1, -1, 1)
c2 = (-1, -1, 1)
c3 = ( 2, 2, -2, 1, 1)
let
c = (c0, c1, c2, c3)
a(3) = -4
The NAS Problem
Then M would be
10000000000000 2
01000000000000 2
0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4
0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1
0 0 0 0 1 0 0 0 0 0 0 0 0 0 -1
00000100000000 1
0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 1 0 0 0 0 0 0 -1
00000000100000 1
00000000010000 2
00000000001000 2
0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2
00000000000010 1
00000000000001 1
c0
c1
c2
c3
The NAS Problem
Step (1)
10011000000000 0
01000000000000 2
0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4
0 0 0 1 0 0 0 0 0 0 0 0 0 0 -1
0 0 0 0 1 0 0 0 0 0 0 0 0 0 -1
00000100000000 1
0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 1 0 0 0 0 0 0 -1
00000000100000 1
00000000010000 2
00000000001000 2
0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2
00000000000010 1
00000000000001 1
c0
c1
c2
c3
The NAS Problem
Step (3)
10011000000000 0
01000000000000 2
0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4
00010100000000 0
00001100000000 0
00000100000000 1
0 0 0 0 0 0 1 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 1 0 0 0 0 0 0 -1
00000000100000 1
00000000010000 2
00000000001000 2
0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2
00000000000010 1
00000000000001 1
c0
c1
c2
c3
The NAS Problem
Step (4)
10011000000000 0
01000011000000 0
0 0 1 0 0 0 0 0 0 0 0 0 0 0 -4
00010100000000 0
00001100000000 0
00000100000000 1
00000010100000 0
00000001100000 0
00000000100000 1
00000000010000 2
00000000001000 2
0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2
00000000000010 1
00000000000001 1
c0
c1
c2
c3
The NAS Problem
Step (5)
10011000000000 0
01000011000000 0
00100000011000 0
00010100000000 0
00001100000000 0
00000100000000 1
00000010100000 0
00000001100000 0
00000000100000 1
00000000010000 2
00000000001000 2
0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2
00000000000010 1
00000000000001 1
c0
c1
c2
c3
The NAS Problem
Step (6)
10011000000000 0
01000011000000 0
00100000011000 0
00010100000000 0
00001100000000 0
00000100000000 1
00000010100000 0
00000001100000 0
00000000100000 1
00000000010100 0
00000000001100 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 -2
00000000000010 1
00000000000001 1
c0
c1
c2
c3
The NAS Problem
Final step
10011000000000
01000011000000
00100000011000
00010100000000
00001100000000
00000100000000
00000010100000
00000001100000
00000000100000
00000000010010
00000000001010
00000000000111
00000000000010
00000000000001
0
0
0
0
0
1
0
0
1
0
0
0
1
1
c0
c1
c2
c3
The NAS Problem
a(i) > 1
ci = ( 
i=1,..,p-1
- a(i )   - a(i )   a(i )   - a(i )   - a(i )   a(i )   - a(i )   - a(i )   a(i ) 
,
,
,
,
,
,...,
,
,
2   2   2   2 2   2 2   2 2   2 ki   2 ki   2 ki 
ki = log2 a(i) 
)
The NAS Problem
a(i) = 7
ci = ( -3, -3, 3, -1, -1, 1 )
ki = log2 7  = 2
a(i) = 8
ci = ( -4, -4, 4, -2, -2, 2, -1, -1, 1 )
ki = log2 8  = 3
The NAS Problem
B = max{ |a(i)| : i = 1,…,p}
The SSP has input dimension equal to O( p × log2(B)).
ki  log2(B)
The dimension of the matrix M is q × (q +1) where
q  ( p + 1 ) × 3 log2(B)  O( p × log2(B) )
Solving problem (1)
A x = r, xFm
jS xj is an F-invariant?
A is the vertex-edge incidence matrix of a graph,
F is the set of reals and
S is singleton.
r2
x1
r1
x6
x3
x2
r3
r5
x5
x4
r4
x7
x8
r6
Solving problem (1)
Consider the homogeneous system associated to system (1)
A y = 0, yRm
(2)
We call circulation a solution y of system (2).
0
+a
0
0
0
-a
0
0
0
-a
+a
0
0
0
Solving problem (1)
Definition : given a circulation y then its support is the set
C = { e : ye  0
}
+a
0
-a
0
-a
+a
0
0
Solving problem (1)
Theorem 1: The unknown xe is an R-invariant if and only if
circulation y with support C then eC.
Proof:
Let x* be a particular solution of (1). Then
x = x* + y
So if ye=0, circulation y then xe = xe*,  solution x of (1).
If xe is invariant then
xe – xe* = 0 = ye
For every solution x of (1). Therefore ye = 0 for every circulation y.
Solving problem (1)
Definition : A circulation y with support C is minimal if
there is no circulation with support C such that CC.
+a
+a
-a
+a
-2a
-2a
+3a
-a
Solving problem (1)
The support of minimal circulations are called circuits and
are the even cycles and the L-oddsets of the graph.
+a
+a
-a
-a
+a
+2a
-a
-a
-2a
+a
-a
+a
+a
-a
-a
+a
+a
-a
Solving problem (1)
Given a circulation y then
y = i=1,…,pai yi
where
ai R
B={y1,…, yp} is a base of N
each yi is a circuit of G
Solving problem (1)
+β
+a
-a
+a
-a
-β
-a
+2a
-a
-β
+β
Solving problem (1)
Theorem 2: The unknown xe is an R-invariant if and only
if circuit yi with support C then eC.
Proof:
ye= i=1,…,pai yi,e = 0
Solving problem (1)
An odd edge is an edge of G belonging to every odd
cycles of G.
A bridge is an edge of G whose removal disconnect G.
Solving problem (1)
Theorem 3: The unknown xe is an R-invariant if and only if e
is an odd edge or is a bridge that disconnect a bipartite
component of G.
Proof:
1) If e belongs to all odd cycles of G then G cannot contains
an l-oddset.
2) If e is a bridge then it cannot belong to an even cycle.
Solving problem (1)
The case when e is an odd edge.
Let for contraddiction D be an even cycle containing e.
D  C is a set of edge-disjoint cycles not containing e.
|D  C| = |D| +|C| - 2 |D  C|
|D  C| is odd and D  C must contains at least one odd
cycle (contraddiction).
Solving problem (1)
The case when e is a bridge disconnecting a bipartite
component.
e
non bipartite
component
bipartite
component
Solving problem (1)
E(H) = { e : e is a bridge of G}
V(H) = { v : v is a connected component of G-E(H)}
G
H
Solving problem (1)
Step 1
Solving problem (1)
Step 2
Solving problem (1)
Step 3
Solving problem (1)
Step 4
Solving problem (1)
Step 5
Solving problem (1)
Step 6
Solving problem (1)
Step 7
Solving problem (1)
Step 8
Solving problem (1)
A DFS traversal of a graph gives a partition of the edges of G
tree edges
back edges
Each back edge e generates a cycle C(e)
The cycle C(e) is called a fundamental cycle with respect to
the tree T
Solving problem (1)
Proposition: every cycle of G can be obtained as the
symmetric difference of one or more fundamental cycles.
If e is an odd edge then
1) it must belong to every fundamental odd cycle of G
1) no fundamental even cycle of G contains e
Solving problem (1)
A back edge e belong to every fundamental odd cycle of
G if and only if C(e) is the only fundamental odd cycle.
For every tree edge e we count the number of odd and
even fundamental cycles containing e.
Related documents