* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Auditing and Inference Control in
Survey
Document related concepts
Open Database Connectivity wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Microsoft Access wikipedia , lookup
Encyclopedia of World Problems and Human Potential wikipedia , lookup
Entity–attribute–value model wikipedia , lookup
Relational algebra wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Ingres (database) wikipedia , lookup
Functional Database Model wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Versant Object Database wikipedia , lookup
Clusterpoint wikipedia , lookup
ContactPoint wikipedia , lookup
Transcript
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982 574 Auditing and Inference Control in Statistical Databases FRANCIS Y. CHIN, MEMBER, IEEE, AND GULTEKIN OZSOYOGLU, MEMBER, IEEE Abstract-A statistical database (SDB) may be defined as an ordinary database with the capability of providing statistical information to user queries. The security problem for the SDB is to limit the use of the SDB so(that only statistical information is available and no sequence of queries is sufficient to infer protected information about any individual. When such information is obtained, the SDB is said to be compromised. Inference control mechanisms are internal protection mechanisms applied to SDB's. Many researchers have studied different protection mechanisms to prevent an SDB from being compromised. However, most of these mechanisms are either ineffective or inefficient or are only applicable to large SDB's. Auditing in SDB's is initially proposed in the form of investigating log trails manually. In this paper, we present a practical technique for managing the past history of user's queries, discuss how the sequence of all the answered queries of the SDB can be reduced and stored in finite storage, and describe how this storage scheme can provide an effective way of checking compromise. We believe that this will help us develop a more practical and efficient tool for protection in a small SDB than the previously known mechanisms. Further, we extend the idea to checking compromise of a set of queries in a more efficient way than one query at a time. We also show that the problem of maximizing the amount of information to the users without compromising the SDB is NP-complete. Index Terms-Auditing, inference control, security, statistical databases. INTRODUCTION T HE PROBLEM of enhancing the security of statistical databases (SDB's) has been of growing concern in recent years [15], [17], [23], [29], [33]. An SDBhasbeen defined as one which returns statistical information, such as frequency counts of records satisfying some given criteria, as opposed to a database which returns details of an entity, for example, name and address of an employee. Statistical databases have wide applicability in areas such as medical research, health planning, and political planning. The security problem for an SDB is to limit its use so that only statistical information is available and no sequence of queries is sufficient to derive confidential information about any individual. When such information is obtained, the database is said to be compromised. I. Manuscript received March 5, 1981; revised February 17, 1982. This work was supported in part by the National Sciences and Engineering Research Council under Grant A4319. A preliminary version of this paper was presented at the ACM '81 Annual Conference. F. Y. Chin is with the Department of Computer Science, University of Alberta, Edmonton, Alta., Canada. G. Ozsoyoglu is with the Department of Computer and Information Science, Cleveland State University, Cleveland, OH 44115. Inference control mechanisms are internal protection mechanisms applied to SDB's. SDB protection mechanisms can be classified as follows [1 ]: 1) controlling the number of records satisfying the query (query set) [4], [12], [14], [21], [22], [31]; 2) limiting excessive overlap between query sets [8], [16], [28]; 3) partitioning the SDB [5], [18], [35]; 4) modifying query responses and data, which includes output perturbation, data distortion, and random sampling [1], [3], [l0]j [13], [24], [27], [32]; 5) employing security constraints at the conceptual data model level [6], [25], [26]. No one proposed protection mechanism is suitable for all SDB's. Protection mechanism 1 is shown to be compromisable [12], and protection mechanism 2 may not be feasible to implement. Mechanism 3 may be overly restrictive and limits the usefulness of the SDB. Mechanism 4, which employs output perturbation, data distortion, or random sampling may be effective for large SDB's but sacrifices provision of precise answers to user queries. Mechanism 5 may be, applicable when the implementation of a conceptual model is feasible and the needs of the users are not very diverse. However, the overhead may be considerable. The proposed design of the conceptual data model is yet to be implemented, tested, and evaluated. All of these five mechanisms are only good for large SDB's and none of them will work for small SDB's. One may argue that statistics are for large sets of data. However, this may not be true for many applications. For example, in medical research, an experiment may record the effects of certain drugs on a small number of individuals (say, 100 individuals or less). Government regulations and company practices normally limit the sample size of the experiment. Different statistical analyses on various subgroups of the individuals have to be performed, e.g., the individuals can be classified according to sex, age, weight, height, profession, salary, living conditions, diet, race, marital status, medical history, education, etc. Obviously, some of this information about each individual is strictly confidential. On the other hand, very precise statistical information for many different subgroups of individuals is needed to draw meaningful conclusions. Unfortunately, none of the existing protection mechanisms can meet all of these requirements. Auditing in SDB's is also discussed in [22], [30]. Logs are maintained to record all the requests made by users along with the data involved. Logs are checked manually and periodically for any misuse of the data. Auditing is also mentioned in [9]. -0098-5589/82/1 100-0574$00.75 1982 IEEE CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL It has long been believed that auditing is an effective tool for protection. The task of auditing may be delegated to the database system so that the database system: 1) keeps track of the history of answered queries and changes in the SDB, and 2) checks for possible compromise by every new query. Obviously, auditing may serve as a solution to the SDB security problem for small SDB's: It is also one of the better protection mechanisms because it has the following features. 1) Absolute Security:1 By checking the past history of all the answered queries, auditing allows the SDB to answer a query only when it is secure to do so. 2) Maximum Information: Given the previous querying history of the SDB, auditing can provide the maximum information to the users. This includes accurate answers and as many query answers to the user as the security of the SDB permits. 3) Flexibility: It is more flexible to use because protection can be tailored to different sets of queries of users' choice. Auditing may also become feasible for large SDB's when it is used together with the concept of compartmentalization [20]. Compartmentalization partitions individuals in the database into groups, so that individuals of a group have the same protection requirements. (Actually, mechanism 5 in [6], [26] achieves this compartmentalization using the semantic information about individuals in the database as a tool to define security atom populations.) Clearly, compartmentalization reduces the size of the database, and thus, may make the auditing of each group of individuals in a compartment feasible. Intuitively it seems infeasible to imploment auditing because it is necessary to store and process the accumulated information about the sequence of previously answered queries. Fortunately, there are certain important properties about the history of the answered queries that can be used to simplify the task of auditing. First, the order of the answered queries is not important (assuming a static SDB). Thus the "set" of answered queries can be used to replace the "sequence" of answered queries. Second, there are usually redundancies in the set of answered queries. Let us consider the following set of queries as an example: q1 = COUNT (all persons), 42 = COUNT (male persons), 43 = COUNT (employed male persons), q4 = COUNT (female persons), qis = COUNT (unemployed male persons), q6 - COUNT (unemployed female persons), q-7 = COUNT (unemployed persons). Obviously, 44, qi, and 47 are redundant because q4 can be derived from q4 - 42; similarly, qs =42 - q3 and i7 =i6 +i2 - q3. There can be many other redundant queries, too, e.g., q8 = COUNT (employed persons), qi9 = COUNT (employed female persons), etc. As the set of answered queries enlargens, so will the redundancies. Moreover, no matter how long the history of the answered queries is, there is only a finite set of nonredundant answered queries since the information in the database is finite. Third, the efficiency of checking for compromisability of a new 1Strictly speaking, there is no such thing as absolute security because there are many unknowns in the system, e.g., users' knowledge. Absolute security is defined formally that no individual information can be inferred solely from the history of the answered queries. 575 query depends on how the set of nonredundant answered queries is represented. Consider the previous example: the set of nonredundant answered queries can be {4if1 42, 43, 46} or if3,45,46,49}- Even though the information conveyed in these two representations are the same, it will be clear that the latter representation is superior to the former representation. The goal of this paper is to present a set of time and storage efficient procedures, called Audit Expert, for auditing SDB's. This Audit Expert, unlike the concept of experts in [34], does not have semantic knowledge embedded in it. However, it is intelligent in the sense that it can check the new query for compromise, and change its auditing strategy for efficiency purposes. The Audit Expert has the following features. 1) It works with no or very little intervention of the DBA. 2) It is isolated and independent from the DBMS and can be "added on" to an existing DBMS without major modifications. 3) It is time and storage efficient. Section II introduces the basic definitions and preliminaries. Security results and checking procedures are described in Section III. Section IV, considers efficient procedures for processing batched queries. It is also shown in this section that the problem of maximizing the number of answered queries is NPcomplete. Section V extends the auditing to a more general environment. Section VI presents the conclusions and the discussion of future research. II. DEFINITIONS AND PRELIMINARIES An SDB consists of n individuals ri, 1 < i < n. For notational simplicity, each individual ri is assumed to have a single protected numerical attribute value xi. Generalization to individuals with more than one protected attribute value is straightforward. A query qi specifies a set of individuals, called query set S(q), and associates with each individual ri a nonnegative integer a1 with the property that ai > 1 if ri C S(q) and = 0 otherwise. The response to q is the weighted a1xXi if 4 is answerable (i.e., 4 does not lead to sum, compromise); otherwise the response is undefined. In this paper, we assume that the SDB can only answer SuM queries, i.e., Z{ijriS(q-p)}xi. This is a special case of the above model with ai = 1. The generalization for ai > 1, for ri E S(4) is straightforward. Each answered query reveals a linear equation Z1=I ai x* - = d with ai E {0, 1} for some constant d. Thus, from now on, the terms query and equation will be used interchangeably. Given a set of answered queries, or equivalently, a set of equations, users can form new equations by addition and/or scalar multiplication of the equations corresponding to the set of answered queries. The users' knowledge set is then defined as the set of linear equations obtained from linear combinations of equations in the set of answered queries. The database is said to be compromised if there exists an equation with a single variable of the form, xi = c in the users' knowledge set. Since we are only interested in the security of the SDB, and not in knowing the actual attribute values of individuals, we need only to keep track of the values of the ai's for the answered queries and can ignore their responses. In other 576 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982 words, as far as the security problem is concerned, the set of Procedure CHECK (i ); if iJcKS then inform the SDB to answer 4f answered queries AQ is defined as the set of vectors, else if the SDB is secure when 4 is added to AQ {(ail e a/2, , a1n) ,Z1 a1ixi is a response to a user's query>. then begin The user's knowledge space is the vector space spanned by the inform the SDB to answer 4; modify KS to include 4 set of vectors in AQ. Formally, KS has the following end properties. else inform the SDB to ignore 4 1) If4EAQ,then4eKS. endif endif 2) If 4 E KS, then b4 C KS; b is a real number. Fig. 1. Procedure CHECK to check for compromise when a new query 3) If 41, 4 EEKS, then 41 +42 EKS. 4 is received. 4) Nothing else is in KS. Example 1: Assume we have six individuals r1, I < i < 6 in Without loss of generality, Bk is of the form the database as shown below. - - - Unmarried Doctor Engineer Male ri .. I r4 r2~~~r ,I Engineer r3 r=r .W : Female Married Doctor Bk=f -= 'B t | where Ik = k X k identity matrix and B' = k X (n - k) matrix. Example 2: Consider the database and queries in Example 1. We have _..tt Assume annual salary xi of each ri is protected. Let the queries 4i 12 _ - (l1 1,1,1,0, 1) (sum salary of male persons) (0, 1, 0, 0, 0, 1) (suM salary of unmarried engineers) answered, i.e., AQ {=4142} We then have KS {(y, (y + z),y,y 0, (y + Z))(y, CZ real numbers}. a We say a query q = (al **,a,) is redundant if 4 is a linear combination of (or linearly dependent to) the set of vectors in AQ, that is, 4 E KS. We also say the SDB, or r,, is compromised if there exists a vector of the form (0, * 0, a1, 0 *, 0) in KS with ai = 1. Conversely, we say the SDB is secure if none of the r1's is compromised. The task of the Audit Expert is to keep a storage efficient representation of KS and, when it receives a new query to execute the procedure CHECK in Fig. 1. We describe in the following section the Audit Expert with its representation of KS and its checking procedure. are z - 4, III. AUDIT EXPERT KS can be represented by a maximal set of nonredundant vectors in AQ, that forms the set of basis vectors in KS. Let the dimension of KS be k, i.e., there are k basis vectors, {(ai, , * * fain), i= 1, 2, k}. KS can be represented by a k X n matrix of the form aki akn_ loll O1 O O O 1 [100and B2 I OlI O O 1 The following theorems explain why KS is represented in the form of Bk. From now on, let 4q (al , a2* , an) where and 42 [11 a- I or 0. ATeorem 1: 4 E KS iff 4 = X avi Proof: Trivial because b ,* bk are the basis vectors of KS and 6i resembles the ith unit vector. a Theorem 2: The SDB is secure iff there does not exist a row bi in Bk such that bij = 0 for all i, k <1. n where Bk =(bj). Proof: "Only if"Part: It follows directly from the definition. "If" Part: We want to show that there does not exist a vector of the form (0, .. , 0, 1, 0, ... , 0) which is a linear combination of {b1, * * *, bk}. The "1" cannot be at the ith position with i . k since none of the bi's is of that form. Besides, the "1" cannot be at a position larger than k because any linear combination of the be's will have at least one nonzero element at a position less than or egqual to k. O Theorems I and 2 state that if KS is represented in the form of matrix Bk, then it is a simple task to check whether 4 E KS and whether the SUB is secure. From Theorem 2, it is also obvious that if k = n, the- SDB cannot be secure because Bn is an identity matrix, and all the ri's will then be compromisable. On the other hand, by considering the following example, we see that k can be as large as (n - 1) while maintaining SDB security: Bn = jIni i| where a,iO or l, 1j. l i.n. < k, Based on this, we now analyze the worst case time comSince the rows of the above matrix are linearly independent vectors, by elementary row operations, the matrix can be plexity of procedure CHECK (4). transformed to matrix Bk with the property that there exist k 1) From Theorem 1, checking whether 4 E KS takes no columns each of which has exactly one nonzero element. more than O(kn) steps, where 0-notation is described in [2]. 577 CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL 2) If q £KS, define q' =q - Y=j aibi where ai denotes , an). q' has the propthe ith element of q. Let q' = (al, erty that a; = 0 for 1 <j .k. Without loss of generality, suppose, ak+l 0. Add q' to Bk to form a new (k + 1) X n matrix having q' as the (k + I)st row. Normalize the new matrix making ak+1 = 1 and bi,k+l = 0 for 1 < i < k. Then we have Bk+l. From Theorem 2, checking whether the SDB is secure takes no more than 0(kn) steps. Thus, the total time for this step takes no more than 0(kn) steps. Since k can be as large as n - 1, we have the following theorem. Theorem 3: The Audit Expert takes no more than O(n2) to process a new query. Example 3: Consider the database and queries in Examples 1 and 2. Let a new query q3 be (0, 0, 1, 0, 1, 0) (i.e., salaries of married engineers). From Theorem 1, we have q3 KS = since 2 Z q3i i=l -bi =0 (1,0, 1, 1,0,0)+0-(0, 1, 0,0,, 1) (0, 0, 0, 0, 0, 0) = Thus, we have B0 1 43] LO 01 0 1 00 bm ma= max & Bk biI K > (I + k bmax) k nlbij I - i=1 Theorem 4: QEaKSiff qiCKS for all 1 .i<t. Proof: "If" Part: The proof follows directly from the closure property of KS. "Only if" Part: Since Q can be expressed as qci + it is sufficient to K(q2 + *. +K(Qtj2 + K(qt1l + Kqt) i 1 < E for all that < showing the case by only t, show qi KS , a1+ Q = qI + Kc2 = (a11 + Ka2, a Ka2n). From Theorem 1 and the fact that bj1 are integers, we have k k i=i i=l Q = 41 + K42 = E (a1I=b1i) bi + K E (a2 /bij) bi (1) We are going to prove by contradiction that 0 and 1 0 0 1 -1 0 B3= q3 batch of queries one by one, we need only check Q for compromise. If the probability that Q E KS is high then such a check improves the efficiency of the Audit Expert significantly. First define k = the dimension of the KS-matrix, Bk 1000 o1i O 0 1 0 1 0 k q= i=1 (alilb1i) -bi and k Q2 = Z (a2ilb1i) bi. i=1 Without losing generality, assume the first equality does not IV. ANSWERING A SET OF QUERIES hold at the jth position, i.e., ai1 - 2 1j (a1/b11) * b11 0. It is In this section, we examine the problem of handling a batch easy to see that a21 - SM (a2 1/b11) b1j $ 0. From (1) and of user queries qj, 1 < i < t. Section IV-A discusses an effi- K>0, we have cient method of checking for compromise, caused by a batch of user queries. Section IV-B considers the problem of LKia kaa1/lb1) bill ~a21- k a2i/bi1) . bi] optimizing the available information to the users without compromising the SDB. Unfortunately, this optimization problem is intractable. In the following notations aij and b11 = [ali (alilbij) ] [za2j - a2i *Yi bii] refer to the jth element of the ith query and of the ith basis vector, respectively. In the discussion, we assume that b11 = 0 (2) for 1.i, jik, b11$:0 and bi and b1j, k<j<n, are all where integers. Dealing with integers will simplify the presentation without loss of generality. 1k k z=H lb111 and yo -H 1bIbj A. Checking for Compromise The most straightforward method of processing a batch of since user queries, qi, 1 < i < t, is to separately check whether each qi CKS. Let us call this method algorithm Lo. Clearly algoz< (I + k bmax) rithm Lo takes 0(tkn) steps. A better method is to decide with a single check whether the whole batch of queries is in KS. In order to perform this check, a new query Qis defined and as Q = q + Kq2 + K2q3 +-.. + Kt-lqt where K is a constant sufficiently large so that the queries qi do not interfere with fz a21 - k a21 Yj. bzjj E integers. each other. Consequently, instead of checking the whole = - [a ](ailbii) bkj - bik IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982 578 Equation (2) implies K < (1 + k bmax) H11 Ib con tradicting the construction that K > (1 + k * bmax) * 11=l Ibi 1. Thus, L q1I q2 EKS. We now have shown from the above discussion and Theorem 4 that, instead of checking a batch of t queries one by one by calling procedure CHECK(qi) t times, one can form a new query Q which includes all these t queries and check for Q E KS. If Q E KS then all q4, 1 < i < t, can be answered without compromise. Otherwise, for each qj, 1 . i . t, CHECK(qi) is called. Let us call this algorithm L 1. Example 4: Consider the database and queries in Examples 1-3. Let the new batch of queries be {44, q5, and 6} where - q4 = (1, 1, 1, 1, 0, 1) (SuM salary of male persons) = (1,0, 1, 1, O, O) (suM salary of unmarried doctors and married male persons) q6 = (0, 1, 1, 0, 1, 1) (suM salary of engineers). We have bmax = 1, K 5 (sinceK> I + 3) and compromise. Clearly, PtIj = (1 - p). Similarly, assuming Pi constant, we have for t = 2 CO(t=,2)= Co(t= 1) + (1 + 3p1) 2k'(n - k') - where k' = k + pI is the expected value of k after the first query in the batch is checked for compromise. Thus, Co (t) =E2(k + ip, ) [n - (k + ipl )] (1 1=0 + 2 pi ) with the constraint that k + (t - l)plI manipulations, Co(t)= 2(1 + 3Pl n. With simple (l where w(p) kt(n - k) +p1 t(t - 1)(n/2 - k) 1 )P2 t(t- 1)(t- ) Let us now derive Cl (t), expected cost equation for L . Assuming bmax is maintained by the system the cost of finding K may be ignored. The cost of forming Q = 41 + K42 + * * * + Q q4 +K45 + K246= (6, 26, 31, 6, 25, 26). Kt- ' q-t = (... ((K4t + 4t- 1 ) K + 4t-,2 )..*)K + q-, is n(t - 1) Since 2t=1 Qi bi = (6, 26, 31, 6, 25, 26) =Q,we have QeKS multiplications and n(t - 1) additions, i.e., 2n(t - 1) operations. Since checking for Q eKS takes 2k(n - k) operations, and from Theorem 4 we conclude that q4, q5, q6 E KS. Next we investigate the expected time improvement of L, one has over Lo. C, (t) = 2n(t - 1) + 2k(n - k) + (I pt) Co() 1) Expected Improvements: Let p be the probability that a To compare Co(t) and Cl (t),2 we define the gain G(t) as new query is in KS. We assume that p is estimated or monitored by the Audit Expert. Now if 2n is sufficiently large, one G(t) = Co(t) - Cl (t) may also assume that = 2pt w(pi) + 3p' -pi *w(p1) pt = probability that all t queries are in KS. - [2n(t - 1) + 2k(n - k)]. (4) Let us first derive expected cost equation Co(t) of Lo. Let = zero vector iff Theorem 1, q4 4' 4 l=1 (aia/bi) bi. By Clearly, G(t) is maximized when p = 1. Also, for any n, k, and 4 e KS. However, since we would like to deal with integers t values, there is a p value that makes the gain G(t) zero. Let (see Section IV) and Bk is an integer matrix, we multiply 4' by us tabulate these p values. Since p l < p, replacing P I by p in z = Ji_l bi and obtain integer vector q" as q = z ' = z q - G(t) introduces errors in the first and second terms of (4). y bi where yi = 1 /bi 21 = l bjj. Clearly, 4" is a zero Let P1 + e l as * yi where e > 0. Then we have vector iff 4' is. Moreover, the first k elements of q" is always G(t) = 2p1 w(p) + 3p1 * pw(p) q, )) zero (i.e., q' = 0, 1 .<i < k, where q4 - (qI, q1', q3, Thus to compute q4 it suffices to compute q", k + I < i < n. - [2n(t- l)+ 2k(n - k)] +E1 +E2 Computing z *4 does not need any multiplications since ai (an element of q) is 0 or 1. Similarly, to cotnpute ai *yi * bi, where the introduced errors El, E2 in the first and second arithmetic operations are needed only to compute Yt * bi. terms of (4) are Assuming z and yi, 1 < i < k, are maintained by the Audit e2) - )(t 4) El = 2pt[(4 )(2eh-t(tExpert, and the last (n - k) terms of 4" is computed using - et(t - 1)(n/2 - k)] iii =(O.* ((Z *q - q Yy 1b,)- a2 Y2*b2) * )- ak Yo*bk) we need k(n - k) multiplications and k(n - k) subtractions to E2 = 3pt[-ekt(n - k) + (-2p-e + 62) t(t- l)(n/2 - k) compUte q". Thus checking 4 E KS takes 2k(n - k) operations. +(j2e+ 1 63-pe2). t(t- 1)(t- )]. If q4 KS then whether to answer 4 or not is decided as follows: add 4" to Bk to form a new (k + I)n matrix B,+l and Since 2ep - e2 > 0 and e > 0, by deleting the positive terms modify Bk+j by we have I < i < k. (3) bi = qk+l * bi - bi(k+,) * q n if k 210 Now if there exists a row j of Bk+l with only bXj nonzero, then ~~~~2 - El. E1 > 4 leads to compromise and is not answered, otherwise q is answered. Since b11 is zero, i #j, I j < k + 1, Bk+l is com2pt- .t(t- 1) k- 2 puted by 2k(n - k) multiplications and k(n - k) subtractions, i.e., 3k(n - k) operations. Thus, 20ne referee pointed out to us that when k becomes sufficiently large, - k) t 3k(n - k) +ppi Co(t = 1) = 2k(n LI may need multiple precision arithmetic, and thus number of operawhere Pi1 is the probability that q E KS and q- does not lead to tions used in Lo and L1 may not be comparable. - - IX CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL P PROBABILITIES TABLE I ABOVE WHICH THE GAIN G(t) IS 100 AND VARYING k, t POSITIVE FOR n = 50, n=50 t 2 4 6 8 10 12 16 10 .78 .81 .85 .89 .92 .95 .97 \k 20 24 30 20 .75 .76 .80 .83 .85 .87 .90 .92 .93 .95 30 .75 .75 .79 .82 .85 ,86 .89 .91 .92 .94 40 .76 .78 .82 .84 .87 .88 .91 .92 .93 .95 45 .80 .82 .86 .88 .90 .91 .93 .95 -- -- 579 scenario for the Audit Expert is to keep tables containing t = t* that maximizes the gain for varying n, k, and p values, and to adjust the batch size accordingly. One may envision algorithms that further split the batch of t queries if Q E KS. However, as it is seen from Table II, t values that maximize the gain G(t) are very small. Thus, any expected case speed gains due to more complicated algorithms are bound to be small, and are not investigated further. B. Maximizing the Number of Answered Queries As long as the SDB is not compromised, it is desirable to provide the users with as much information as possible. To this end, after receiving a batch of queries, the Audit Expert try to answer the maximum number of queries without compromising the database. Consider the following example: the database has four records and has answered two queries, 0 .78 .81 .86 .90 .94 .9790.9 (1, 1, 1) and q2 = (1, 1,0,0). Assume the batch of 2(0 .74 .80 .76 .83 .86 .88 .91 .93 .95 .98 new queries is q3 = (0, 1,1,0), 5= 4 (0,0, 1, 1), 410 .73 .74 .77 .81 .85 .88 .83 .90 .91 .93 (1, 0, 0, 1), and i6 = (1, 0, 1, 0). The system can always 6(0 .73 .73 .77 .80 .84 .83 .87 .89 .91 .92 answer q4 because q4 EKS since i4 q - q2. If the system 810 .74 .75 .81 .78 .83 .85 .88 .90 .91 .92 has chosen to answer q6, no other queries can be answered without compromising the SDB. However, the SDI3 would 0 .76 .77 .81 .84 .86 .88 .90 .91 .92 .94 answer the maximum number of queries by choosing to answer q3 and q5 instead of 46- Unfortunately, the problem of answering the maximum number of users queries is NP-hard even under a very restricted situation. Define the "minimum edge-deletion bipartite3 subgraph" problem [19] and the "maximum query-answered auditing" I\ problem as follows. MEBS: Given an undirected graph G = (V, E) and a pogitive 2 * t integer k, does G have a bipartite subgraph formed by deleting Fig. 2. Typical G(t) for t > 2 and sufficiently large p. k or fewer edges? MQ: Given a set of individuals and their attribute values, a set of answered queries, a batch of new queries B and a posiSimilarly, since --1 <-2pe+e2 <0 and P22e -ie2>0 we tive integer k, does there exist a set of answerable queries have B' C B formed by deleting k or fewer queries in B? The MEBS problem has been shown NP-complete in [19]. ~~~~~n We shall prove that the MQ problem is NP-complete from the -3pt -tkt(n - k) if k ~~~~~~~2 reduction of MEBS problem. In fact, we present the stronger result that the MQ problem remains NP-complete even when E2 > the set of answered queries is null and every new query in the -3Pt [pkt(n k) + t(t 1) 2- k otherwis batch involves exactly two individuals. From now on, we consider just this restricted MQ problem (RMQ problem). Thus, Theorem S: MEBS problem a RMQ problem.4 Before we describe the construction used in the reduction, G(t)>2p w (p)+3p pw(p) we discuss the RMQ problem to enhance understanding. A set of queries, each involving attribute values of exactly two - [2n(t- 1) + 2k(n k)J +Ei +E. individuals, can be characterized by a query graph [4]. An undirected graph G = (V, E) is called a query graph for a dataFor any given n, k, and t values, p values that make the RHS base if V is the set of individuals 1 < i < n; and (rl, r1) is in of above inequality zero will give a lower bound on p probabili- E if and only if there exists a ri, query q involving the attribute ties for which LI is superior to Lo in the expected case. values of and r1 r1 (i.e., xi and xi). It is also shown in [4] that Table I tabulates these probabilities for n = 50 and n = 100. if every query involves attribute values of exactly two individClearly, for sufficiently large p, as t increases from t = 2, uals, then a necessary and sufficient condition for a secure G(t) is expected to increase and then decrease since the terms with pt will diminish. Thus, a typical curve of G(t) for,t > 2 3A bipartite graph is defimed as a graph, the vertices of which can be looks as in Fig. 2. Table II lists t values (i.e., t* in Fig. 2), into two disjoint subsets such that no vertex in a subset is that maximize the gain G(t) (with T 1 variations in t* due to divided adjacent to vertices in the same subset. errors E1 and E2) for n 50 and n 100. A possible 4The notation a means "is polynomially reducible to." n=100 \1 KN may t z - A.4 r ts IV IU1 1) I10 e< 1) f) LU ). L4 on 3U = = oth rw se - = = IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982 580 TABLE 1I BATCH SIZE t THAT MAXIMIZES THE GAIN G(t) FOR n 50, 100 AND VARYING k, p n=100 n=50 p P 0.80 0.85 0.90 10 4 5 8 10 4 5 8 15 20 4 6 8 0.95 16 17 15' 4 5 8 16 501 4 6 9 18 20 4 5 8 16 70 4 6 8 17 30 4 5 8 16 80 4 5 8 16 40; - 5 7 14 90 3 5 7 14 45 - 4 6 11 k 5 0.80 0.85 0.90 0.95 k - 4 7 13 SDB is the nonexistence of odd cycles in the corresponding employed females and unemployed females respectively. Asquery graph. Thus, in order to protect the SDB, the corre- sume the KS is represented by the set of queries, {3, q5, sponding query graph for the set of answered queries should q6, q9}, each of which corresponds to a group of individuals. be bipartite. If the query graph for the batch of new queries Thus, instead of storing the KS-matrix as does not turn out to be bipartite, some queries must be Xl X2 X3 X4 X5 X6 X7 X8 X9 deleted from the batch (or equivalently not answered) so as to make the resultant query graph bipartite. Unfortunately, the q3 71 0 0 0 1 0 0 0 0 problem of deleting the minimum number of edges (queries) 0 1 0 0 0 0 1 0 0 5 from the query graph (batch of new queries) in order to make 0O 1 0 0 1 0 0 1 the resultant graph bipartite is NP-complete. q6 Proof of Theorem 5: Formally, we transform MEBS to q O 0 0 1 0 0 0 1 0 RMQ. Let G = (V, E) be the undirected graph. We shall construct a set of individuals and a batch of new queries B such we can group the identical columns together and have that G has a bipartite subgraph by deleting k or fewer edges G1 G2 G3 G4 iff B has a subset of answerable queries of size at least B - k. The construction of RMQ replaces each vertex in V by an where GI={xl,x.} q3 1 0 0 0 individual in the database and each edge in E by a new query, G2 = {x2,x7} 4S 0 1 0 o based on G as its query graph. The above discussion based on the result in [4] assures us that this is indeed the required G3 {X3,X6,X9} Q6 0 0 1 0 transformation. 1g O 0 0 1 G4= {x4,X8}. V. EXTENSIONS In previous sections, we have discussed the basic principles Let us call Gi a basic group. Obviously, U7 1 Gi = {x 1, * *, Xm} behind the Audit Expert. There are still a few modifications and Gi n G= 0 for all i and j. Moreover, all the individuals in a basic group always appear together in all the answered which can improve its performance. queries. Basically, procedure CHECK(q) is executed exactly as A. Time and Storage Improvement before. Besides the fact that the reduced KS-matrix is smaller We have discussed how the efficiency of the Audit Expert in size, it also allows us to claim that the SDB is secure if none depends on the representation of KS. Each row in the KS- of the basic groups is a singleton set. However, there is an matrix represents a nonredundant answered query. A good overhead for representing the KS-matrix in terms of the basic representation of the KS-matrix should have the least possible groups. If the new query splits up some basic groups, extra overlap between the query sets corresponding to the rows of columns are needed for the split groups in the KS-matrix. the KS-matrix. That is, it is desirable to have as small number Fortunately, as we will show, the overhead in checking and of nonzero entries in the KS-matrix as possible. This is splitting basic groups takes no more than 0(n) time and analogous to having a set of "orthogonal" vectors as the basis storage. Procedure SPLIT (q) checks whether the new query qt splits for a vector space. One may also observe that individuals with similar charac- up any basic groups; if so, new basic groups are created. It teristics tend to be together in the answered queries. Ac- labels (or flags) and identifies the basic group for every elecordingly, the KS-matrix tends to have identical columns, and ment in the query set S(qj). Assume individual ri is an element it is reasonable to represent all those identical columns by a of S(q), and the corresponding attribute value xi belongs to single one in order to save storage and to speed up the check- Gk. Procedure SPLIT (q) checks whether all individuals ri coring process. Consider the example in the introduction. Let responding to xi in Gk are in S(q). If not, Gk is split into two {x1,x5}, {x2,x7}, {x4,x8}, and {x3,x6,x9} be the at- basic groups, Gk f {xi Iri E S(q)} and Gk fl {xi IrirS()}. tribute value groups of employed males, unemployed males, Since all basic groups are disjoint, we have the following result. CHIN AND OZSOYOGLU: AUDITING AND INFERENCE CONTROL Procedure sPLIT(q) Input: m basic groups, Gi, i = 1, 2,. m stored as linked-lists array T(1 ::n) where T(i) is the index of the basic group G1 to which xi belongs. iq, the new query in the form of (a1, a2, * *, an) where a1E {O, I} and initially is unlabled. i'. m, Output: The new set of basic groups, Gi, i= 1, - (1) (2) (3) begin i-O; m'<-m; while i-i+ 1 .n do if ai = 1 and unlabled then begin k-T(i) Gm'+ 1 +0 (4) for all xieGk do if a1= 1 then label a1 else /* split the basic group */ begin Gk*-Gk-{x-}; Gm,+ I Gm,+ I u {Xi}; T(j)+-rn'+1; end (5) end. end endwhile if Gm,+ I 0 then m'-m'+ 1 endif Fig. 3. Procedure SPLIT. Theorem 6: Procedure SPLIT requires 0(n) time and storage to create a new set of basic groups. Proof: Steps 3-5 of the procedure SPLIT shown in Fig. 3 are executed only when ai is unlabeled. Since step 4 always labels all the elements in Gk n {xi riEe S(Q)} at the same time, either all or none of the elements in Gk n {xi Iri E S(4)} are labeled. Because all the basic groups are disjoint, step 4 will not label any element more than once and will not be executed more than n times. Moreover, since Gk and Gn'+l are stored as linked list, updating each one at step 4 can be done in constant time. Thus, steps 3 and 5 are executed at most n times, and the total time is 0(n). Since the T array, and the linked-lists for Gi are of size n, the storage requirement is also 0(n). B. Protection on Some Particular Queries On many occasions, the SDB, besides preventing the individual's information from compromise, may at the same time be required to protect the answer for a certain set of individuals, say 5'. As a consequence, the answer for the query q, with S(') = S', should be prohibited. This protection is easy to achieve if the whole set of individuals S(q) is grouped together and considered as a single individual. However, this protection scheme disallows any information about any proper subsets of S(q ), and is overly restrictive. On the other hand, if information about the subsets of S(q4) is revealed without any precaution, then users may manipulate this information to obtain the answer to q^. The known techniques seem to be unable to cope with this problem effectively and efficiently. The problem of revealing information about subsets of q(q) and at the same time protecting the answer to q can be solved under the present implementation of the Audit Expert with a minor modification to the procedure CHECK. Assume KS is represented by a k X n matrix Bk, 4 is the 581 query to be protected, q4'KS, and q^ is the query to be checked for compromise and answered if the SDB is secure. If q E KS, then the answer to q is protected. Assume q j KS, it checks whether q leads to compromise by obtaining the (k + 1) X n matrix Bk+1 with new basis vectors bi, i < i . k + 1, as described in step 2) in Section III. Theorem 7: The answer to q is protected, i.e., q KS if and only if q # xi=11 a14b. El Proof: Similar to the proof of Theorem 1. In order to check for compromise of protected attribute values Bk+l and bi vectors, 1 . i < k + 1, are computed. From the above theorem, checking whether q E KS takes no more than 0(kn) steps. Thus the total time need for CHECK(q) after incorporating an additional check of whether q is secure still takes 0(n2) time. C. Protection under a Dynamic Environment So far we have considered a static database system without changes such as insertions, deletions, or updates of individuals. Below we consider these changes and show that the Audit Expert also works very well in a dynamic environment. 1) Insertions: This is taken care of easily by adding another column of all zeros in the KS-matrix. The security of any other individual is not affected. 2) Deletions: There are two cases depending on whether or not the information about the deleted individual needs protection. If the attribute value of the deleted individual does not require protection and revealing it to the users does not lead to any other compromise, then the column which corresponds to that deleted individual in the KS-matrix can be eliminated permanently. Otherwise, there are no changes to the KSmatrix, and everything is processed as usual except that the deleted individual will never be involved in any other queries. Consequently, some queries which were answered may not be answerable after the deletion. 3) Updates: If the value x of an individual in the database gets changed to x', it is equivalent to an insertion of x' followed immediately by a deletion of x or vice versa. If the value of x needs protection, the new KS-matrix should have two columns, one for x and the other for x'. In practice, these two columns can be merged into one and, as a result, the KSmatrix remains unchanged. However, this implementation of updates does not protect the change between x and x'. For certain information, such as the increase in the salary of an employee or the increase in profit of a company, it is desirable to protect the amount of change too. One way to implement this is to add to the KS-matrix a new column which corresponds to the change between x and x'. Everytime when x' is referenced in a query, the old value x and the change (x - x') will be involved. Since the old value x and the change are protected, x' will be protected at the same time. D. User Preknowledge In some cases, it is possible that users have additional knowledge about the database. For example, assume the response to q' is known to the users even though qt' is never asked. This knowledge can be taken care of by simply adding q' to the knowledge space KS and modifying the KS-matrix. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. SE-8, NO. 6, NOVEMBER 1982 582 [15] D. Dobkin, R. J. Lipton, and S. P. Reiss, "Aspects of the dataVI. CONCLUSION base security problem," in Proc. Conf. Theoretical Comput. Sci., We have discussed the basic principles behind an auditing Waterloo, Canada, 1977, pp. 262-274. mechanism, called the Audit Expert in SDB's for SUM queries. [161 D. Dobkin, A. K. Jones, and R. J. Lipton, "Secure databases: Protection against user inference," ACM Trans. Database Syst., We have described two procedures, CHECK to check for comvol. 4, no. 1, pp. 97-106, 1979. promise and SPLIT to maintain a storage efficient representa- [171 I. P. Fellegi, "On the question of statistical confidentiality," J. Amer. Statist. Ass., vol. 67, pp. 7-18, 1972. tion. To answer a batch of queries efficiently, a fast comP. Feilegi and J. L. Phillips, "Statistical confidentiality: Some promisability check is described. It is also shown that [181 I.theory and applications to data dissemination," Ann. Econ. maximizing the set of answerable queries in a given batch of Soc. Measurement, vol. 3, no. 2, pp. 399-409, 1972. queries is NP-complete. [191 M. Garey, D. Johnson, and L. Stockmeyer, "Some simplified NP-complete graph problems," J. Theory Comput. Sci., vol. 1, The Audit Expert does not have semantic knowledge empp. 237-267, 1976. bedded in it, but it is intelligent in the sense that it can change [20] D. K. Hsiao, D. S. Kerr, and S. E. Madnick, "Privacy and security of data communication and databases," in VLDB Proc., 1978, its auditing strategy by using different algorithms in different pp. 56-67. conditions. A discussion of these algorithms and the mechaL. J. Hoffman and W. F. Miller, "Getting a personal dossier from nisms for other types of queries is deferred to another paper. [211 a statistical data bank," Datamation, vol. 16, no. 5, pp. 74-75, ACKNOWLEDGMENT The authors would like to thank P. Higham for her careful reading of the manuscript and the referees for their very useful comments that improved the presentation of the paper. REFERENCES [1] J. D. Achugbue and F. Y. Chin, "The effectiveness of output modification by rounding for protection of statistical databases," INFOR, vol. 17, no. 3, pp. 209-218, 1979. [2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman Thie Design and Analysis of Computer Algorithms. Reading, MA: AddisonWesley, 1974. [31LL.Beck, W3L.e L.y Beck, Ascurity security mechanism for statistical database," DepA Comput. Sci., Southern Methodist Univ., 1979; see also ACMTODS, Sept. 1980. [4] F. Y. Chin, "Security in statistical databases for queries with small counts," ACM Trans. Database Syst., vol. 3, no. 1, pp. 92-104, 1978. [5] F. Y. Chin and G. Ozsoyoglu, "Security in partitioned dynamic statistical databases," in Proc. IEEE 3rd Int. Conf. Comput. Software and Applications, Nov. 1979. [6] -, "Statistical database design," ACM Trans. Database Syst., vol. 6, no. 1, pp. 113-130, 1981. [71 -, "Security of statistical databases," in Advances in Computer SecurityManagement. NewYork: Hayden, 1980.[8] G. Davida, D. Linton, G. Szelag, and D. Wells, "Security of statistical databases," Dep. Elec. Eng. & Comput. Sci., Univ. of Wisconsin, Milwaukee, Rep. TR-CS-76-14, 1976. [9] R. DeMillo, D. Dobkin, and R. J. Lipton, "Combinatorial inference," in Foundations of Secure Computation, DeMillo et al., Eds. New York: Academic, 1978, pp. 27-38 (presented at a 3 day workshop, Georgia Inst. Technol., Atlanta, Oct. 1977). 110] R. DeMillo, D. Dobkin, and R. J. Lipton, "Even databases that lie can be compromised," IEEE Trans. Software Eng., vol. SE-4, no. 1, pp. 73-75, 1978. [11] D. E. Denning, "Are statistical databases secure?," in Proc. AFIPS NCC, vol. 47, 1978. [12] D. E. Denning, P. J. Denning, and M. D. Schwartz, "The tracker: A threat to statistical database security," ACM Trans. Database Syst., vol. 4, no. 1, pp. 76-96, 1979. [131 D. E. Denning, "Secure statistical databases with random sample queries," ACMTODS, Sept. 1980. [14] D. E. Denning and J. Schlorer, "A fast procedure for finding a tracker in a statistical database," ACM Trans. Database Syst., vol. 5,no. 1,pp. 88-102, 1980. "Apt SuhmMehanismfordsistUnical database,ao 1970. [221 L. J. Hoffman, Modern Methods for Computer Security and Privacy. Englewood Cliffs, NJ: Prentice-Hall, 1977. [23] J. B. Kam and J. D. Ullman, "A model of statistical databases and their security," ACM Trans. Database Syst., vol. 2, no. 1, pp. 1-10, 1977. [24] M. S. Nargundkar and W. Saveland, "Random rounding to prevent statistical disclosure," in Proc. Amer. Statist. Ass., Soc. Statist. Sec., 1972, pp. 382-385. r ici1 r+; --A11 _n ; VX7 shn+onAa;t n n s 0 a lJ - uaoyugiu anu r. I n tical databases with a question-answering system and a kernel design," Dep. Comput. Sci., Univ. Alberta, Tech. Rep., 1980. [261 G- Ozsoyoglu, "Secure statistical database design," Ph.D. disComput.andSci., Univ. Alberta, 1980. P. Reiss,Dep. "Medians database security," in Foundations of [27] S.sertation, Secure Computations. New York: Academic, 1978, pp. 57-92. study," J. [281 S. P. Reiss, "Security in databases: A [29] [301 31] [32] [33] [341 [35] commiatorial Ass. Comput. Mach., vol. 26, no. 1, pp. 45-57, 1979. J Schlorer, "Identification and retrieval of personal records from asaitcldtbn, ehd nom e. o.1,n.1 pp. 7-15, 1975. -, "Confidentiality of statistical records: A threat monitoring scheme for on-line dialogue," Methods Inform. Med., vol. 1n5, no. l,pp 36-42, 1976. - "Union tracker and open statistical databases," TB-IMSD 1/78,Inst. Med. Statist. Dok., Univ. Giessen, 1979. , "Security of statistical databases: Multidimensional transformation,"ACM Trans. Database Syst., vol. 6, no. 1, pp. 85-112, 1981. D. E. Denning and P. J. Denning, "Linear queries in statistical databases," ACM Trans. Database Syst., vol. 4, no. 2, pp. 156-167, 1979. M. Stonebraker and K. Keller, "Embedding expert knowledge and hypothetical data bases into a data base system," in ACM SIGMOD Proc., 1980, pp. 58-66. C. T. Yu and F. Y. Chin, "A study on the protection of statistical databases," in Proc. ACM SIGMOND nt. Conf Management of Data, 1977, pp. 169-181. Francis Y. Chin (S'71-M'76), for a photograph and biography, see p. 234 of the May 1982 issue to this TRANSACTIONS. Gultekin Ozsoyoglu (S'79-M'80), for a photograph and biography, see p. 234 of the May 1982 issue of this TRANSACTIONS.