Download Auditing and Inference Control in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

IMDb wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Microsoft Access wikipedia , lookup

Encyclopedia of World Problems and Human Potential wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Relational algebra wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Ingres (database) wikipedia , lookup

Functional Database Model wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Database model wikipedia , lookup

Relational model wikipedia , lookup

Transcript
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982
574
Auditing and Inference Control in
Statistical Databases
FRANCIS Y. CHIN, MEMBER, IEEE, AND GULTEKIN OZSOYOGLU, MEMBER, IEEE
Abstract-A statistical database (SDB) may be defined as an ordinary
database with the capability of providing statistical information to
user queries. The security problem for the SDB is to limit the use of
the SDB so(that only statistical information is available and no sequence
of queries is sufficient to infer protected information about any
individual. When such information is obtained, the SDB is said to be
compromised.
Inference control mechanisms are internal protection mechanisms
applied to SDB's. Many researchers have studied different protection
mechanisms to prevent an SDB from being compromised. However,
most of these mechanisms are either ineffective or inefficient or are
only applicable to large SDB's. Auditing in SDB's is initially proposed
in the form of investigating log trails manually. In this paper, we
present a practical technique for managing the past history of user's
queries, discuss how the sequence of all the answered queries of the
SDB can be reduced and stored in finite storage, and describe how this
storage scheme can provide an effective way of checking compromise.
We believe that this will help us develop a more practical and efficient
tool for protection in a small SDB than the previously known
mechanisms. Further, we extend the idea to checking compromise of a
set of queries in a more efficient way than one query at a time. We
also show that the problem of maximizing the amount of information
to the users without compromising the SDB is NP-complete.
Index Terms-Auditing, inference control, security, statistical databases.
INTRODUCTION
T HE PROBLEM of enhancing the security of statistical
databases (SDB's) has been of growing concern in recent
years [15], [17], [23], [29], [33]. An SDBhasbeen defined
as one which returns statistical information, such as frequency
counts of records satisfying some given criteria, as opposed to
a database which returns details of an entity, for example,
name and address of an employee. Statistical databases have
wide applicability in areas such as medical research, health
planning, and political planning.
The security problem for an SDB is to limit its use so that
only statistical information is available and no sequence of
queries is sufficient to derive confidential information about
any individual. When such information is obtained, the database is said to be compromised.
I.
Manuscript received March 5, 1981; revised February 17, 1982. This
work was supported in part by the National Sciences and Engineering
Research Council under Grant A4319. A preliminary version of this
paper was presented at the ACM '81 Annual Conference.
F. Y. Chin is with the Department of Computer Science, University
of Alberta, Edmonton, Alta., Canada.
G. Ozsoyoglu is with the Department of Computer and Information
Science, Cleveland State University, Cleveland, OH 44115.
Inference control mechanisms are internal protection mechanisms applied to SDB's. SDB protection mechanisms can be
classified as follows [1 ]:
1) controlling the number of records satisfying the query
(query set) [4], [12], [14], [21], [22], [31];
2) limiting excessive overlap between query sets [8], [16],
[28];
3) partitioning the SDB [5], [18], [35];
4) modifying query responses and data, which includes output perturbation, data distortion, and random sampling [1],
[3], [l0]j [13], [24], [27], [32];
5) employing security constraints at the conceptual data
model level [6], [25], [26].
No one proposed protection mechanism is suitable for all
SDB's. Protection mechanism 1 is shown to be compromisable
[12], and protection mechanism 2 may not be feasible to
implement. Mechanism 3 may be overly restrictive and limits
the usefulness of the SDB. Mechanism 4, which employs output perturbation, data distortion, or random sampling may be
effective for large SDB's but sacrifices provision of precise
answers to user queries. Mechanism 5 may be, applicable
when the implementation of a conceptual model is feasible
and the needs of the users are not very diverse. However, the
overhead may be considerable. The proposed design of the
conceptual data model is yet to be implemented, tested, and
evaluated. All of these five mechanisms are only good for
large SDB's and none of them will work for small SDB's. One
may argue that statistics are for large sets of data. However,
this may not be true for many applications. For example, in
medical research, an experiment may record the effects of
certain drugs on a small number of individuals (say, 100
individuals or less). Government regulations and company
practices normally limit the sample size of the experiment.
Different statistical analyses on various subgroups of the
individuals have to be performed, e.g., the individuals can be
classified according to sex, age, weight, height, profession,
salary, living conditions, diet, race, marital status, medical
history, education, etc. Obviously, some of this information
about each individual is strictly confidential. On the other
hand, very precise statistical information for many different
subgroups of individuals is needed to draw meaningful conclusions. Unfortunately, none of the existing protection mechanisms can meet all of these requirements.
Auditing in SDB's is also discussed in [22], [30]. Logs are
maintained to record all the requests made by users along with
the data involved. Logs are checked manually and periodically
for any misuse of the data. Auditing is also mentioned in [9].
-0098-5589/82/1 100-0574$00.75
1982 IEEE
CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL
It has long been believed that auditing is an effective tool for
protection. The task of auditing may be delegated to the
database system so that the database system:
1) keeps track of the history of answered queries and
changes in the SDB, and
2) checks for possible compromise by every new query.
Obviously, auditing may serve as a solution to the SDB
security problem for small SDB's: It is also one of the better
protection mechanisms because it has the following features.
1) Absolute Security:1 By checking the past history of all
the answered queries, auditing allows the SDB to answer a
query only when it is secure to do so.
2) Maximum Information: Given the previous querying
history of the SDB, auditing can provide the maximum information to the users. This includes accurate answers and as
many query answers to the user as the security of the SDB
permits.
3) Flexibility: It is more flexible to use because protection
can be tailored to different sets of queries of users' choice.
Auditing may also become feasible for large SDB's when it is
used together with the concept of compartmentalization [20].
Compartmentalization partitions individuals in the database
into groups, so that individuals of a group have the same
protection requirements. (Actually, mechanism 5 in [6], [26]
achieves this compartmentalization using the semantic information about individuals in the database as a tool to define
security atom populations.) Clearly, compartmentalization
reduces the size of the database, and thus, may make the
auditing of each group of individuals in a compartment
feasible.
Intuitively it seems infeasible to imploment auditing because
it is necessary to store and process the accumulated information about the sequence of previously answered queries. Fortunately, there are certain important properties about the
history of the answered queries that can be used to simplify
the task of auditing. First, the order of the answered queries is
not important (assuming a static SDB). Thus the "set" of
answered queries can be used to replace the "sequence" of
answered queries. Second, there are usually redundancies in
the set of answered queries. Let us consider the following set
of queries as an example: q1 = COUNT (all persons), 42 =
COUNT (male persons), 43 = COUNT (employed male persons),
q4 = COUNT (female persons), qis = COUNT (unemployed male
persons), q6 - COUNT (unemployed female persons), q-7 =
COUNT (unemployed persons). Obviously, 44, qi, and 47 are
redundant because q4 can be derived from q4 - 42; similarly,
qs =42 - q3 and i7 =i6 +i2 - q3. There can be many
other redundant queries, too, e.g., q8 = COUNT (employed
persons), qi9 = COUNT (employed female persons), etc. As the
set of answered queries enlargens, so will the redundancies.
Moreover, no matter how long the history of the answered
queries is, there is only a finite set of nonredundant answered
queries since the information in the database is finite. Third,
the efficiency of checking for compromisability of a new
1Strictly speaking, there is no such thing as absolute security because
there are many unknowns in the system, e.g., users' knowledge. Absolute security is defined formally that no individual information can
be inferred solely from the history of the answered queries.
575
query depends on how the set of nonredundant answered
queries is represented. Consider the previous example: the set
of nonredundant answered queries can be {4if1 42, 43, 46} or
if3,45,46,49}- Even though the information conveyed in
these two representations are the same, it will be clear that the
latter representation is superior to the former representation.
The goal of this paper is to present a set of time and storage
efficient procedures, called Audit Expert, for auditing SDB's.
This Audit Expert, unlike the concept of experts in [34], does
not have semantic knowledge embedded in it. However, it is
intelligent in the sense that it can check the new query for
compromise, and change its auditing strategy for efficiency
purposes. The Audit Expert has the following features.
1) It works with no or very little intervention of the DBA.
2) It is isolated and independent from the DBMS and can
be "added on" to an existing DBMS without major modifications.
3) It is time and storage efficient.
Section II introduces the basic definitions and preliminaries.
Security results and checking procedures are described in
Section III. Section IV, considers efficient procedures for processing batched queries. It is also shown in this section that the
problem of maximizing the number of answered queries is NPcomplete. Section V extends the auditing to a more general
environment. Section VI presents the conclusions and the discussion of future research.
II. DEFINITIONS AND PRELIMINARIES
An SDB consists of n individuals ri, 1 < i < n. For notational
simplicity, each individual ri is assumed to have a single
protected numerical attribute value xi. Generalization to
individuals with more than one protected attribute value is
straightforward. A query qi specifies a set of individuals,
called query set S(q), and associates with each individual ri a
nonnegative integer a1 with the property that ai > 1 if ri C
S(q) and = 0 otherwise. The response to q is the weighted
a1xXi if 4 is answerable (i.e., 4 does not lead to
sum,
compromise); otherwise the response is undefined. In this
paper, we assume that the SDB can only answer SuM queries,
i.e., Z{ijriS(q-p)}xi. This is a special case of the above model
with ai = 1. The generalization for ai > 1, for ri E S(4) is
straightforward.
Each answered query reveals a linear equation Z1=I ai x* - = d
with ai E {0, 1} for some constant d. Thus, from now on, the
terms query and equation will be used interchangeably. Given
a set of answered queries, or equivalently, a set of equations,
users can form new equations by addition and/or scalar multiplication of the equations corresponding to the set of
answered queries. The users' knowledge set is then defined
as the set of linear equations obtained from linear combinations of equations in the set of answered queries. The database is said to be compromised if there exists an equation with
a single variable of the form, xi = c in the users' knowledge
set. Since we are only interested in the security of the SDB,
and not in knowing the actual attribute values of individuals,
we need only to keep track of the values of the ai's for the
answered queries and can ignore their responses. In other
576
IEEE TRANSACTIONS ON
SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER
1982
words, as far as the security problem is concerned, the set of
Procedure CHECK (i );
if iJcKS then inform the SDB to answer 4f
answered queries AQ is defined as the set of vectors,
else if the SDB is secure when 4 is added to AQ
{(ail e a/2, , a1n) ,Z1 a1ixi is a response to a user's query>.
then begin
The user's knowledge space is the vector space spanned by the
inform the SDB to answer 4;
modify KS to include 4
set of vectors in AQ. Formally, KS has the following
end
properties.
else inform the SDB to ignore 4
1) If4EAQ,then4eKS.
endif
endif
2) If 4 E KS, then b4 C KS; b is a real number.
Fig. 1. Procedure CHECK to check for compromise when a new query
3) If 41, 4 EEKS, then 41 +42 EKS.
4 is received.
4) Nothing else is in KS.
Example 1: Assume we have six individuals r1, I < i < 6 in
Without loss of generality, Bk is of the form
the database as shown below.
-
-
-
Unmarried
Doctor Engineer
Male
ri
..
I
r4
r2~~~r
,I
Engineer
r3
r=r
.W
:
Female
Married
Doctor
Bk=f -= 'B
t
|
where Ik = k X k identity matrix and B' = k X (n - k) matrix.
Example 2: Consider the database and queries in Example 1.
We have
_..tt
Assume annual salary xi of each ri is protected.
Let the queries
4i
12
_
- (l1 1,1,1,0, 1) (sum salary of male persons)
(0, 1, 0, 0, 0, 1) (suM salary of unmarried engineers)
answered, i.e., AQ {=4142} We then have KS
{(y, (y + z),y,y 0, (y + Z))(y, CZ real numbers}.
a
We say a query q = (al **,a,)
is redundant if 4 is a linear
combination of (or linearly dependent to) the set of vectors in
AQ, that is, 4 E KS. We also say the SDB, or r,, is compromised if there exists a vector of the form (0, *
0, a1,
0 *, 0) in KS with ai = 1. Conversely, we say the SDB is
secure if none of the r1's is compromised.
The task of the Audit Expert is to keep a storage efficient
representation of KS and, when it receives a new query to
execute the procedure CHECK in Fig. 1.
We describe in the following section the Audit Expert with
its representation of KS and its checking procedure.
are
z
-
4,
III. AUDIT EXPERT
KS can be represented by a maximal set of nonredundant
vectors in AQ, that forms the set of basis vectors in KS. Let
the dimension of KS be k, i.e., there are k basis vectors,
{(ai, , * * fain), i= 1, 2, k}.
KS can be represented by a k X n matrix of the form
aki
akn_
loll
O1 O O O 1
[100and
B2
I OlI O O
1
The following theorems explain why KS is represented in the
form of Bk. From now on, let 4q (al , a2* , an) where
and
42
[11
a- I or 0.
ATeorem 1: 4 E KS iff 4 = X avi
Proof: Trivial because b ,* bk are the basis vectors of
KS and 6i resembles the ith unit vector.
a
Theorem 2: The SDB is secure iff there does not exist a row
bi in Bk such that bij = 0 for all i, k <1. n where Bk =(bj).
Proof:
"Only if"Part: It follows directly from the definition.
"If" Part: We want to show that there does not exist a
vector of the form (0, .. , 0, 1, 0, ... , 0) which is a linear
combination of {b1, * * *, bk}. The "1" cannot be at the ith
position with i . k since none of the bi's is of that form.
Besides, the "1" cannot be at a position larger than k because
any linear combination of the be's will have at least one nonzero element at a position less than or egqual to k. O
Theorems I and 2 state that if KS is represented in the form
of matrix Bk, then it is a simple task to check whether 4 E KS
and whether the SUB is secure.
From Theorem 2, it is also obvious that if k = n, the- SDB
cannot be secure because Bn is an identity matrix, and all the
ri's will then be compromisable. On the other hand, by considering the following example, we see that k can be as large
as (n - 1) while maintaining SDB security:
Bn = jIni
i|
where a,iO or l, 1j.
l
i.n.
< k,
Based on this, we now analyze the worst case time comSince the rows of the above matrix are linearly independent
vectors, by elementary row operations, the matrix can be plexity of procedure CHECK (4).
transformed to matrix Bk with the property that there exist k
1) From Theorem 1, checking whether 4 E KS takes no
columns each of which has exactly one nonzero element. more than O(kn) steps, where 0-notation is described in [2].
577
CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL
2) If q £KS, define q' =q - Y=j aibi where ai denotes
, an). q' has the propthe ith element of q. Let q' = (al,
erty that a; = 0 for 1 <j .k. Without loss of generality,
suppose, ak+l 0. Add q' to Bk to form a new (k + 1) X n
matrix having q' as the (k + I)st row. Normalize the new
matrix making ak+1 = 1 and bi,k+l = 0 for 1 < i < k. Then we
have Bk+l. From Theorem 2, checking whether the SDB is
secure takes no more than 0(kn) steps. Thus, the total time
for this step takes no more than 0(kn) steps.
Since k can be as large as n - 1, we have the following
theorem.
Theorem 3: The Audit Expert takes no more than O(n2) to
process a new query.
Example 3: Consider the database and queries in Examples
1 and 2. Let a new query q3 be (0, 0, 1, 0, 1, 0) (i.e., salaries
of married engineers). From Theorem 1, we have q3 KS
=
since
2
Z q3i
i=l
-bi =0 (1,0, 1, 1,0,0)+0-(0, 1, 0,0,, 1)
(0, 0, 0, 0, 0, 0)
=
Thus, we have
B0 1
43]
LO
01 0
1 00
bm ma= max
& Bk
biI
K > (I + k bmax)
k
nlbij I -
i=1
Theorem 4: QEaKSiff qiCKS for all 1 .i<t.
Proof:
"If" Part: The proof follows directly from the closure
property of KS.
"Only if" Part: Since Q can be expressed as qci +
it is sufficient to
K(q2 + *. +K(Qtj2 + K(qt1l + Kqt)
i
1
<
E
for
all
that
<
showing the case
by
only
t,
show
qi KS
, a1+
Q = qI + Kc2 = (a11 + Ka2,
a
Ka2n).
From Theorem 1 and the fact that bj1 are integers, we have
k
k
i=i
i=l
Q = 41 + K42 = E (a1I=b1i) bi + K E (a2 /bij) bi
(1)
We are going to prove by contradiction that
0
and
1 0 0 1 -1 0
B3=
q3
batch of queries one by one, we need only check Q for
compromise. If the probability that Q E KS is high then such
a check improves the efficiency of the Audit Expert
significantly.
First define k = the dimension of the KS-matrix, Bk
1000
o1i
O 0 1 0
1 0
k
q=
i=1
(alilb1i) -bi
and
k
Q2 = Z (a2ilb1i) bi.
i=1
Without losing generality, assume the first equality does not
IV. ANSWERING A SET OF QUERIES
hold at the jth position, i.e., ai1 - 2 1j (a1/b11) * b11 0. It is
In this section, we examine the problem of handling a batch easy to see that a21 - SM (a2 1/b11) b1j $ 0. From (1) and
of user queries qj, 1 < i < t. Section IV-A discusses an effi- K>0, we have
cient method of checking for compromise, caused by a batch
of user queries. Section IV-B considers the problem of
LKia kaa1/lb1) bill ~a21- k a2i/bi1) . bi]
optimizing the available information to the users without
compromising the SDB. Unfortunately, this optimization
problem is intractable. In the following notations aij and b11
= [ali (alilbij) ]
[za2j - a2i *Yi bii]
refer to the jth element of the ith query and of the ith basis
vector, respectively. In the discussion, we assume that b11 = 0
(2)
for 1.i, jik, b11$:0 and bi and b1j, k<j<n, are all
where
integers. Dealing with integers will simplify the presentation
without loss of generality.
1k
k
z=H lb111 and yo -H 1bIbj
A. Checking for Compromise
The most straightforward method of processing a batch of since
user queries, qi, 1 < i < t, is to separately check whether each
qi CKS. Let us call this method algorithm Lo. Clearly algoz< (I + k bmax)
rithm Lo takes 0(tkn) steps. A better method is to decide
with a single check whether the whole batch of queries is in
KS. In order to perform this check, a new query Qis defined and
as Q = q + Kq2 + K2q3 +-.. + Kt-lqt where K is a constant
sufficiently large so that the queries qi do not interfere with
fz a21 - k a21 Yj. bzjj E integers.
each other. Consequently, instead of checking the whole
=
-
[a
](ailbii) bkj
-
bik
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982
578
Equation (2) implies K < (1 + k bmax) H11 Ib con tradicting the construction that K > (1 + k * bmax) * 11=l Ibi 1. Thus,
L
q1I q2 EKS.
We now have shown from the above discussion and Theorem
4 that, instead of checking a batch of t queries one by one by
calling procedure CHECK(qi) t times, one can form a new
query Q which includes all these t queries and check for
Q E KS. If Q E KS then all q4, 1 < i < t, can be answered
without compromise. Otherwise, for each qj, 1 . i . t,
CHECK(qi) is called. Let us call this algorithm L 1.
Example 4: Consider the database and queries in Examples
1-3. Let the new batch of queries be {44, q5, and 6} where
-
q4 = (1, 1, 1, 1, 0, 1) (SuM salary of male persons)
=
(1,0, 1, 1, O, O) (suM salary of unmarried doctors
and married male persons)
q6 = (0, 1, 1, 0, 1, 1) (suM salary of engineers).
We have bmax = 1, K 5 (sinceK> I + 3) and
compromise. Clearly, PtIj = (1 - p). Similarly, assuming
Pi constant, we have for t = 2
CO(t=,2)= Co(t= 1) + (1 + 3p1) 2k'(n - k')
-
where k' = k + pI is the expected value of k after the first
query in the batch is checked for compromise. Thus,
Co (t) =E2(k + ip, ) [n - (k + ipl )] (1
1=0
+ 2 pi )
with the constraint that k + (t - l)plI
manipulations,
Co(t)= 2(1
+
3Pl
n.
With simple
(l
where
w(p) kt(n - k) +p1 t(t - 1)(n/2 - k)
1
)P2 t(t- 1)(t-
)
Let us now derive Cl (t), expected cost equation for L . Assuming bmax is maintained by the system the cost of finding
K may be ignored. The cost of forming Q = 41 + K42 + * * * +
Q q4 +K45 + K246= (6, 26, 31, 6, 25, 26).
Kt- ' q-t = (... ((K4t + 4t- 1 ) K + 4t-,2 )..*)K + q-, is n(t - 1)
Since 2t=1 Qi bi = (6, 26, 31, 6, 25, 26) =Q,we have QeKS multiplications and n(t - 1) additions, i.e., 2n(t - 1) operations. Since checking for Q eKS takes 2k(n - k) operations,
and from Theorem 4 we conclude that q4, q5, q6 E KS.
Next we investigate the expected time improvement of L, one has
over Lo.
C, (t) = 2n(t - 1) + 2k(n - k) + (I pt) Co()
1) Expected Improvements: Let p be the probability that a
To compare Co(t) and Cl (t),2 we define the gain G(t) as
new query is in KS. We assume that p is estimated or monitored by the Audit Expert. Now if 2n is sufficiently large, one
G(t) = Co(t) - Cl (t)
may also assume that
= 2pt w(pi) + 3p' -pi *w(p1)
pt = probability that all t queries are in KS.
- [2n(t - 1) + 2k(n - k)].
(4)
Let us first derive expected cost equation Co(t) of Lo. Let
=
zero
vector
iff
Theorem
1, q4
4' 4 l=1 (aia/bi) bi. By
Clearly, G(t) is maximized when p = 1. Also, for any n, k, and
4 e KS. However, since we would like to deal with integers t values, there is a p value that makes the gain G(t) zero. Let
(see Section IV) and Bk is an integer matrix, we multiply 4' by us tabulate these p values. Since p l < p, replacing P I by p in
z = Ji_l bi and obtain integer vector q" as q = z ' = z q - G(t) introduces errors in the first and second terms of (4).
y bi where yi = 1 /bi 21 = l bjj. Clearly, 4" is a zero Let P1 + e
l as * yi
where e > 0. Then we have
vector iff 4' is. Moreover, the first k elements of q" is always
G(t) = 2p1 w(p) + 3p1 * pw(p)
q, ))
zero (i.e., q' = 0, 1 .<i < k, where q4 - (qI, q1', q3,
Thus to compute q4 it suffices to compute q", k + I < i < n.
- [2n(t- l)+ 2k(n - k)] +E1 +E2
Computing z *4 does not need any multiplications since ai
(an element of q) is 0 or 1. Similarly, to cotnpute ai *yi * bi, where the introduced errors El, E2 in the first and second
arithmetic operations are needed only to compute Yt * bi. terms of (4) are
Assuming z and yi, 1 < i < k, are maintained by the Audit
e2) - )(t 4)
El = 2pt[(4 )(2eh-t(tExpert, and the last (n - k) terms of 4" is computed using
- et(t - 1)(n/2 - k)]
iii =(O.* ((Z *q - q Yy 1b,)- a2 Y2*b2) * )- ak Yo*bk)
we need k(n - k) multiplications and k(n - k) subtractions to
E2 = 3pt[-ekt(n - k) + (-2p-e + 62) t(t- l)(n/2 - k)
compUte q". Thus checking 4 E KS takes 2k(n - k) operations.
+(j2e+ 1 63-pe2). t(t- 1)(t- )].
If q4 KS then whether to answer 4 or not is decided as
follows: add 4" to Bk to form a new (k + I)n matrix B,+l and Since 2ep - e2 > 0 and e > 0, by deleting the positive terms
modify Bk+j by
we have
I < i < k.
(3)
bi = qk+l * bi - bi(k+,) * q
n
if k 210
Now if there exists a row j of Bk+l with only bXj nonzero, then
~~~~2
- El.
E1 >
4 leads to compromise and is not answered, otherwise q is
answered. Since b11 is zero, i #j, I j < k + 1, Bk+l is com2pt- .t(t- 1) k- 2
puted by 2k(n - k) multiplications and k(n - k) subtractions,
i.e., 3k(n - k) operations. Thus,
20ne referee pointed out to us that when k becomes sufficiently large,
- k)
t 3k(n - k)
+ppi
Co(t = 1) = 2k(n
LI may need multiple precision arithmetic, and thus number of operawhere Pi1 is the probability that q E KS and q- does not lead to tions used in Lo and L1 may not be comparable.
-
-
IX
CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL
P PROBABILITIES
TABLE I
ABOVE WHICH THE GAIN G(t) IS
100 AND VARYING k, t
POSITIVE
FOR n =
50,
n=50
t
2
4
6
8
10
12
16
10
.78
.81
.85
.89
.92
.95
.97
\k
20
24
30
20
.75
.76
.80
.83
.85
.87
.90
.92
.93
.95
30
.75
.75
.79
.82
.85
,86
.89
.91
.92
.94
40
.76
.78
.82
.84
.87
.88
.91
.92
.93
.95
45
.80
.82
.86
.88
.90
.91
.93
.95
--
--
579
scenario for the Audit Expert is to keep tables containing
t = t* that maximizes the gain for varying n, k, and p values,
and to adjust the batch size accordingly. One may envision
algorithms that further split the batch of t queries if Q E KS.
However, as it is seen from Table II, t values that maximize the
gain G(t)
are very small. Thus, any expected case speed gains
due to more complicated algorithms are bound to be small,
and are not investigated further.
B. Maximizing the Number of Answered Queries
As long as the SDB is not compromised, it is desirable to
provide the users with as much information as possible. To
this end, after receiving a batch of queries, the Audit Expert
try to answer the maximum number of queries without
compromising the database. Consider the following example:
the database has four records and has answered two queries,
0 .78
.81 .86 .90
.94
.9790.9
(1, 1, 1) and q2 = (1, 1,0,0). Assume the batch of
2(0 .74
.80
.76
.83
.86
.88
.91
.93
.95
.98
new queries is q3 = (0, 1,1,0),
5=
4 (0,0, 1, 1),
410 .73 .74 .77 .81
.85
.88
.83
.90
.91
.93
(1, 0, 0, 1), and i6 = (1, 0, 1, 0). The system can always
6(0 .73
.73
.77
.80
.84
.83
.87
.89
.91
.92
answer q4 because q4 EKS since i4 q - q2. If the system
810 .74
.75
.81
.78
.83
.85
.88
.90
.91
.92
has chosen to answer q6, no other queries can be answered
without compromising the SDB. However, the SDI3 would
0 .76
.77
.81
.84
.86
.88
.90
.91
.92
.94
answer the maximum number of queries by choosing to
answer q3 and q5 instead of 46- Unfortunately, the problem
of answering the maximum number of users queries is NP-hard
even under a very restricted situation.
Define the "minimum edge-deletion bipartite3 subgraph"
problem [19] and the "maximum query-answered auditing"
I\
problem as follows.
MEBS: Given an undirected graph G = (V, E) and a pogitive
2
*
t
integer k, does G have a bipartite subgraph formed by deleting
Fig. 2. Typical G(t) for t > 2 and sufficiently large p.
k or fewer edges?
MQ: Given a set of individuals and their attribute values, a
set of answered queries, a batch of new queries B and a posiSimilarly, since --1 <-2pe+e2 <0 and P22e -ie2>0 we tive integer k, does there exist a set of answerable queries
have
B' C B formed by deleting k or fewer queries in B?
The MEBS problem has been shown NP-complete in [19].
~~~~~n
We
shall prove that the MQ problem is NP-complete from the
-3pt -tkt(n - k) if k ~~~~~~~2
reduction of MEBS problem. In fact, we present the stronger
result that the MQ problem remains NP-complete even when
E2 >
the set of answered queries is null and every new query in the
-3Pt [pkt(n k) + t(t 1) 2- k otherwis
batch involves exactly two individuals. From now on, we consider just this restricted MQ problem (RMQ problem).
Thus,
Theorem S: MEBS problem a RMQ problem.4
Before we describe the construction used in the reduction,
G(t)>2p w (p)+3p pw(p)
we discuss the RMQ problem to enhance understanding. A set
of queries, each involving attribute values of exactly two
- [2n(t- 1) + 2k(n
k)J +Ei +E.
individuals, can be characterized by a query graph [4]. An
undirected graph G = (V, E) is called a query graph for a dataFor any given n, k, and t values, p values that make the RHS base if V is the set of individuals
1 < i < n; and (rl, r1) is in
of above inequality zero will give a lower bound on p probabili- E if and only if there exists a ri,
query q involving the attribute
ties for which LI is superior to Lo in the expected case. values of and
r1
r1 (i.e., xi and xi). It is also shown in [4] that
Table I tabulates these probabilities for n = 50 and n = 100.
if every query involves attribute values of exactly two individClearly, for sufficiently large p, as t increases from t = 2, uals, then a necessary and sufficient condition for a secure
G(t) is expected to increase and then decrease since the terms
with pt will diminish. Thus, a typical curve of G(t) for,t > 2
3A bipartite graph is defimed as a graph, the vertices of which can be
looks as in Fig. 2. Table II lists t values (i.e., t* in Fig. 2),
into two disjoint subsets such that no vertex in a subset is
that maximize the gain G(t) (with T 1 variations in t* due to divided
adjacent to vertices in the same subset.
errors E1 and E2) for n 50 and n 100. A possible
4The notation a means "is polynomially reducible to."
n=100
\1
KN
may
t
z
-
A.4
r
ts
IV
IU1 1)
I10
e<
1) f)
LU
).
L4
on
3U
=
=
oth rw se
-
=
=
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982
580
TABLE 1I
BATCH SIZE t THAT MAXIMIZES THE GAIN G(t) FOR n
50, 100 AND VARYING k, p
n=100
n=50
p
P
0.80
0.85
0.90
10
4
5
8
10
4
5
8
15
20
4
6
8
0.95
16
17
15'
4
5
8
16
501
4
6
9
18
20
4
5
8
16
70
4
6
8
17
30
4
5
8
16
80
4
5
8
16
40;
-
5
7
14
90
3
5
7
14
45
-
4
6
11
k
5
0.80
0.85
0.90
0.95
k
-
4
7
13
SDB is the nonexistence of odd cycles in the corresponding employed females and unemployed females respectively. Asquery graph. Thus, in order to protect the SDB, the corre- sume the KS is represented by the set of queries, {3, q5,
sponding query graph for the set of answered queries should q6, q9}, each of which corresponds to a group of individuals.
be bipartite. If the query graph for the batch of new queries Thus, instead of storing the KS-matrix as
does not turn out to be bipartite, some queries must be
Xl X2 X3 X4 X5 X6 X7 X8 X9
deleted from the batch (or equivalently not answered) so as to
make the resultant query graph bipartite. Unfortunately, the
q3 71 0 0 0 1 0 0 0 0
problem of deleting the minimum number of edges (queries)
0 1 0 0 0 0 1 0 0
5
from the query graph (batch of new queries) in order to make
0O 1 0 0 1 0 0 1
the resultant graph bipartite is NP-complete.
q6
Proof of Theorem 5: Formally, we transform MEBS to
q O 0 0 1 0 0 0 1 0
RMQ. Let G = (V, E) be the undirected graph. We shall construct a set of individuals and a batch of new queries B such we can group the identical columns together and have
that G has a bipartite subgraph by deleting k or fewer edges
G1 G2 G3 G4
iff B has a subset of answerable queries of size at least B - k.
The construction of RMQ replaces each vertex in V by an
where GI={xl,x.}
q3 1 0 0 0
individual in the database and each edge in E by a new query,
G2 = {x2,x7}
4S 0 1 0 o
based on G as its query graph. The above discussion based on
the result in [4] assures us that this is indeed the required
G3 {X3,X6,X9}
Q6 0 0 1 0
transformation.
1g O 0 0 1
G4= {x4,X8}.
V. EXTENSIONS
In previous sections, we have discussed the basic principles Let us call Gi a basic group. Obviously, U7 1 Gi = {x 1, * *, Xm}
behind the Audit Expert. There are still a few modifications and Gi n G= 0 for all i and j. Moreover, all the individuals
in a basic group always appear together in all the answered
which can improve its performance.
queries. Basically, procedure CHECK(q) is executed exactly as
A. Time and Storage Improvement
before. Besides the fact that the reduced KS-matrix is smaller
We have discussed how the efficiency of the Audit Expert in size, it also allows us to claim that the SDB is secure if none
depends on the representation of KS. Each row in the KS- of the basic groups is a singleton set. However, there is an
matrix represents a nonredundant answered query. A good overhead for representing the KS-matrix in terms of the basic
representation of the KS-matrix should have the least possible groups. If the new query splits up some basic groups, extra
overlap between the query sets corresponding to the rows of columns are needed for the split groups in the KS-matrix.
the KS-matrix. That is, it is desirable to have as small number Fortunately, as we will show, the overhead in checking and
of nonzero entries in the KS-matrix as possible. This is splitting basic groups takes no more than 0(n) time and
analogous to having a set of "orthogonal" vectors as the basis storage.
Procedure SPLIT (q) checks whether the new query qt splits
for a vector space.
One may also observe that individuals with similar charac- up any basic groups; if so, new basic groups are created. It
teristics tend to be together in the answered queries. Ac- labels (or flags) and identifies the basic group for every elecordingly, the KS-matrix tends to have identical columns, and ment in the query set S(qj). Assume individual ri is an element
it is reasonable to represent all those identical columns by a of S(q), and the corresponding attribute value xi belongs to
single one in order to save storage and to speed up the check- Gk. Procedure SPLIT (q) checks whether all individuals ri coring process. Consider the example in the introduction. Let responding to xi in Gk are in S(q). If not, Gk is split into two
{x1,x5}, {x2,x7}, {x4,x8}, and {x3,x6,x9} be the at- basic groups, Gk f {xi Iri E S(q)} and Gk fl {xi IrirS()}.
tribute value groups of employed males, unemployed males, Since all basic groups are disjoint, we have the following result.
CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL
Procedure sPLIT(q)
Input: m basic groups, Gi, i = 1, 2,. m stored as
linked-lists array T(1 ::n) where T(i) is the index
of the basic group G1 to which xi belongs.
iq, the new query in the form of (a1, a2, * *, an)
where a1E {O, I} and initially is unlabled.
i'.
m,
Output: The new set of basic groups, Gi, i= 1,
-
(1)
(2)
(3)
begin i-O; m'<-m;
while i-i+ 1 .n do
if ai = 1 and unlabled
then begin
k-T(i)
Gm'+ 1 +0
(4)
for all xieGk do
if a1= 1
then label a1
else /* split the basic group */
begin
Gk*-Gk-{x-};
Gm,+ I Gm,+ I u {Xi};
T(j)+-rn'+1;
end
(5)
end.
end
endwhile
if Gm,+ I 0 then m'-m'+ 1 endif
Fig. 3. Procedure SPLIT.
Theorem 6: Procedure SPLIT requires 0(n) time and storage
to create a new set of basic groups.
Proof: Steps 3-5 of the procedure SPLIT shown in Fig. 3
are executed only when ai is unlabeled. Since step 4 always
labels all the elements in Gk n {xi riEe S(Q)} at the same
time, either all or none of the elements in Gk n {xi Iri E S(4)}
are labeled. Because all the basic groups are disjoint, step 4
will not label any element more than once and will not be
executed more than n times. Moreover, since Gk and Gn'+l
are stored as linked list, updating each one at step 4 can be
done in constant time. Thus, steps 3 and 5 are executed at
most n times, and the total time is 0(n). Since the T array,
and the linked-lists for Gi are of size n, the storage requirement is also 0(n).
B. Protection on Some Particular Queries
On many occasions, the SDB, besides preventing the
individual's information from compromise, may at the same
time be required to protect the answer for a certain set of
individuals, say 5'. As a consequence, the answer for the
query q, with S(') = S', should be prohibited. This protection is easy to achieve if the whole set of individuals S(q) is
grouped together and considered as a single individual. However, this protection scheme disallows any information about
any proper subsets of S(q ), and is overly restrictive. On the
other hand, if information about the subsets of S(q4) is revealed without any precaution, then users may manipulate this
information to obtain the answer to q^. The known techniques
seem to be unable to cope with this problem effectively and
efficiently.
The problem of revealing information about subsets of
q(q) and at the same time protecting the answer to q can be
solved under the present implementation of the Audit
Expert with a minor modification to the procedure CHECK.
Assume KS is represented by a k X n matrix Bk, 4 is the
581
query to be protected, q4'KS, and q^ is the query to be
checked for compromise and answered if the SDB is secure.
If q E KS, then the answer to q is protected. Assume q j KS,
it checks whether q leads to compromise by obtaining the
(k + 1) X n matrix Bk+1 with new basis vectors bi, i < i .
k + 1, as described in step 2) in Section III.
Theorem 7: The answer to q is protected, i.e., q KS if
and only if q # xi=11 a14b.
El
Proof: Similar to the proof of Theorem 1.
In order to check for compromise of protected attribute
values Bk+l and bi vectors, 1 . i < k + 1, are computed. From
the above theorem, checking whether q E KS takes no more
than 0(kn) steps. Thus the total time need for CHECK(q)
after incorporating an additional check of whether q is secure
still takes 0(n2) time.
C. Protection under a Dynamic Environment
So far we have considered a static database system without
changes such as insertions, deletions, or updates of individuals.
Below we consider these changes and show that the Audit
Expert also works very well in a dynamic environment.
1) Insertions: This is taken care of easily by adding another column of all zeros in the KS-matrix. The security of
any other individual is not affected.
2) Deletions: There are two cases depending on whether or
not the information about the deleted individual needs protection. If the attribute value of the deleted individual does not
require protection and revealing it to the users does not lead
to any other compromise, then the column which corresponds
to that deleted individual in the KS-matrix can be eliminated
permanently. Otherwise, there are no changes to the KSmatrix, and everything is processed as usual except that the
deleted individual will never be involved in any other queries.
Consequently, some queries which were answered may not be
answerable after the deletion.
3) Updates: If the value x of an individual in the database
gets changed to x', it is equivalent to an insertion of x' followed immediately by a deletion of x or vice versa. If the
value of x needs protection, the new KS-matrix should have
two columns, one for x and the other for x'. In practice, these
two columns can be merged into one and, as a result, the KSmatrix remains unchanged. However, this implementation of
updates does not protect the change between x and x'. For
certain information, such as the increase in the salary of an
employee or the increase in profit of a company, it is desirable
to protect the amount of change too. One way to implement
this is to add to the KS-matrix a new column which corresponds to the change between x and x'. Everytime when x' is
referenced in a query, the old value x and the change (x - x')
will be involved. Since the old value x and the change are
protected, x' will be protected at the same time.
D. User Preknowledge
In some cases, it is possible that users have additional
knowledge about the database. For example, assume the
response to q' is known to the users even though qt' is never
asked. This knowledge can be taken care of by simply adding
q' to the knowledge space KS and modifying the KS-matrix.
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982
582
[15] D. Dobkin, R. J. Lipton, and S. P. Reiss, "Aspects of the dataVI. CONCLUSION
base security problem," in Proc. Conf. Theoretical Comput. Sci.,
We have discussed the basic principles behind an auditing
Waterloo, Canada, 1977, pp. 262-274.
mechanism, called the Audit Expert in SDB's for SUM queries. [161 D. Dobkin, A. K. Jones, and R. J. Lipton, "Secure databases:
Protection against user inference," ACM Trans. Database Syst.,
We have described two procedures, CHECK to check for comvol. 4, no. 1, pp. 97-106, 1979.
promise and SPLIT to maintain a storage efficient representa- [171 I. P. Fellegi, "On the question of statistical confidentiality," J.
Amer. Statist. Ass., vol. 67, pp. 7-18, 1972.
tion. To answer a batch of queries efficiently, a fast comP. Feilegi and J. L. Phillips, "Statistical confidentiality: Some
promisability check is described. It is also shown that [181 I.theory
and applications to data dissemination," Ann. Econ.
maximizing the set of answerable queries in a given batch of
Soc. Measurement, vol. 3, no. 2, pp. 399-409, 1972.
queries is NP-complete.
[191 M. Garey, D. Johnson, and L. Stockmeyer, "Some simplified
NP-complete graph problems," J. Theory Comput. Sci., vol. 1,
The Audit Expert does not have semantic knowledge empp. 237-267, 1976.
bedded in it, but it is intelligent in the sense that it can change [20] D. K. Hsiao, D. S. Kerr, and S. E. Madnick, "Privacy and security
of data communication and databases," in VLDB Proc., 1978,
its auditing strategy by using different algorithms in different
pp. 56-67.
conditions. A discussion of these algorithms and the mechaL. J. Hoffman and W. F. Miller, "Getting a personal dossier from
nisms for other types of queries is deferred to another paper. [211 a statistical data bank," Datamation, vol. 16, no. 5, pp. 74-75,
ACKNOWLEDGMENT
The authors would like to thank P. Higham for her careful
reading of the manuscript and the referees for their very useful
comments that improved the presentation of the paper.
REFERENCES
[1] J. D. Achugbue and F. Y. Chin, "The effectiveness of output
modification by rounding for protection of statistical databases,"
INFOR, vol. 17, no. 3, pp. 209-218, 1979.
[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman Thie Design and
Analysis of Computer Algorithms. Reading, MA: AddisonWesley, 1974.
[31LL.Beck,
W3L.e L.y Beck, Ascurity
security mechanism for statistical database,"
DepA Comput. Sci., Southern Methodist Univ., 1979; see also
ACMTODS, Sept. 1980.
[4] F. Y. Chin, "Security in statistical databases for queries with
small counts," ACM Trans. Database Syst., vol. 3, no. 1, pp.
92-104, 1978.
[5] F. Y. Chin and G. Ozsoyoglu, "Security in partitioned dynamic
statistical databases," in Proc. IEEE 3rd Int. Conf. Comput.
Software and Applications, Nov. 1979.
[6] -, "Statistical database design," ACM Trans. Database Syst.,
vol. 6, no. 1, pp. 113-130, 1981.
[71 -, "Security of statistical databases," in Advances in Computer
SecurityManagement. NewYork: Hayden, 1980.[8] G. Davida, D. Linton, G. Szelag, and D. Wells, "Security of
statistical databases," Dep. Elec. Eng. & Comput. Sci., Univ. of
Wisconsin, Milwaukee, Rep. TR-CS-76-14, 1976.
[9] R. DeMillo, D. Dobkin, and R. J. Lipton, "Combinatorial inference," in Foundations of Secure Computation, DeMillo et al.,
Eds. New York: Academic, 1978, pp. 27-38 (presented at a 3
day workshop, Georgia Inst. Technol., Atlanta, Oct. 1977).
110] R. DeMillo, D. Dobkin, and R. J. Lipton, "Even databases that
lie can be compromised," IEEE Trans. Software Eng., vol. SE-4,
no. 1, pp. 73-75, 1978.
[11] D. E. Denning, "Are statistical databases secure?," in Proc. AFIPS
NCC, vol. 47, 1978.
[12] D. E. Denning, P. J. Denning, and M. D. Schwartz, "The tracker:
A threat to statistical database security," ACM Trans. Database
Syst., vol. 4, no. 1, pp. 76-96, 1979.
[131 D. E. Denning, "Secure statistical databases with random sample
queries," ACMTODS, Sept. 1980.
[14] D. E. Denning and J. Schlorer, "A fast procedure for finding a
tracker in a statistical database," ACM Trans. Database Syst., vol.
5,no. 1,pp. 88-102, 1980.
"Apt SuhmMehanismfordsistUnical database,ao
1970.
[221 L. J. Hoffman, Modern Methods for Computer Security and
Privacy. Englewood Cliffs, NJ: Prentice-Hall, 1977.
[23] J. B. Kam and J. D. Ullman, "A model of statistical databases
and their security," ACM Trans. Database Syst., vol. 2, no. 1,
pp. 1-10, 1977.
[24] M. S. Nargundkar and W. Saveland, "Random rounding to
prevent statistical disclosure," in Proc. Amer. Statist. Ass., Soc.
Statist. Sec., 1972, pp. 382-385.
r ici1 r+;
--A11
_n
; VX7
shn+onAa;t
n
n s
0 a
lJ - uaoyugiu anu r. I n
tical databases with a question-answering system and a kernel
design," Dep. Comput. Sci., Univ. Alberta, Tech. Rep., 1980.
[261 G- Ozsoyoglu, "Secure statistical database design," Ph.D. disComput.andSci.,
Univ. Alberta, 1980.
P. Reiss,Dep.
"Medians
database
security," in Foundations of
[27] S.sertation,
Secure Computations. New York: Academic, 1978, pp. 57-92.
study," J.
[281 S. P. Reiss, "Security in databases: A
[29]
[301
31]
[32]
[33]
[341
[35]
commiatorial
Ass. Comput. Mach., vol. 26, no. 1, pp. 45-57, 1979.
J Schlorer, "Identification and retrieval of personal records from
asaitcldtbn,
ehd nom e. o.1,n.1
pp. 7-15, 1975.
-, "Confidentiality of statistical records: A threat monitoring
scheme for on-line dialogue," Methods Inform. Med., vol. 1n5, no.
l,pp 36-42, 1976.
- "Union tracker and open statistical databases," TB-IMSD
1/78,Inst. Med. Statist. Dok., Univ. Giessen, 1979.
, "Security of statistical databases: Multidimensional transformation,"ACM Trans. Database Syst., vol. 6, no. 1, pp. 85-112,
1981.
D. E. Denning and P. J. Denning, "Linear queries in statistical
databases," ACM Trans. Database Syst., vol. 4, no. 2, pp.
156-167, 1979.
M. Stonebraker and K. Keller, "Embedding expert knowledge
and hypothetical data bases into a data base system," in
ACM SIGMOD Proc., 1980, pp. 58-66.
C. T. Yu and F. Y. Chin, "A study on the protection of statistical
databases," in Proc. ACM SIGMOND nt. Conf Management of
Data, 1977, pp. 169-181.
Francis Y. Chin (S'71-M'76), for a photograph and biography, see p.
234 of the May 1982 issue to this TRANSACTIONS.
Gultekin Ozsoyoglu (S'79-M'80), for a photograph and biography, see
p. 234 of the May 1982 issue of this TRANSACTIONS.