Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Theoretical computer science wikipedia , lookup
Pattern recognition wikipedia , lookup
Predictive analytics wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Neuroinformatics wikipedia , lookup
Corecursion wikipedia , lookup
Regression analysis wikipedia , lookup
Perturbation theory wikipedia , lookup
Inferences Control in Data Bases A. BEKHECHI ID-Partners ABSTRACT : Inferences are of major concern for database managers, particularly in statistical data bases. This problem became apparent when it was discovered that a number of requests on a given set of data could lead to the determination of otherwise hid den data. inferences led first then to data statistical analysis) search for methods to "tracker" type methods perturbation methods has The on resulted methods. databases in The (linear another to databases (or restrictive methods), intensive regre important protect use ssion, source of of from and multidimensional principal components inferences that are difficult to detect in data bases. The objective of this paper is to present a solution for protecting data bases against inferences using data analysis methods, specifically discriminant analysis. The solution is based on the discriminanting power of each attribute in the protected data base. Keywords Databases, Confidentiality, Analysis. 612 , , Security, Data Flow, Statistical 1/ Introduction Security in a concerns data database consists of two distinct functional protection. security whereas Functional the security levels. second level includes The first is database level concerned with access rights as well as what will happen in the case of a computer breakdown or software error. In relational database management systems this security level is well assured and is based on already proven solutions used in operating systems such as passwords to control access and specific mechanisms like recovery. The second security level is the cause of much conCern among systems managers and systems developers because of the confidentiality of certain information managed by the RDBMS. was first worry mentioned in the 1970' s is even in greater a the following sections, by HOFFMAN and MILLER statistical existence of two types of bases: The seriousness of this problem database (SDB) [HOF 70]. because This of the micro databases and macro databases _ In we will present the current state of the illrt and our work on this subject. 2/ State of the art Inference control methods can be divided into two distinct categories: restrictive methods and perturbation methods. The phenomena of inference is possible only database's well when the contents presented in [MCL an "intruder" 86,89]. article possesses Work a prior con-cerning by DENNING and knowledge of restrictive methods SCHLORER [DES 83]. the is In the following section inference control methods based on perturbation will be presented. 3/ Perturbation methods Perturbation methods can be divided into three categories: the perturbation of selected "perturbation records in sensitive" the data characterised values and the set, the perturbation perturbation of the of query results. Data perturbation can take place in two differents ways: constant over time or variable. are based on The main "constant over time" perturbation methods systematic rounding techniques in which the systematic intervals and the random modifications are fixe d in time. These techniques resist very little to attacks of the "tracker" type [DDS 79]. Perturbations 613 which vary in time are mainly represented by random rounding and random interval methods. The random difference of average zero method used in the sum query is also important. These techniques give non-biased results. They are resistant independant, to but "trackers" resist in little successive equivalent queries databases cases to where inferences .their by randomnesses averaging are results of [CUC 85]. One method oriented towards micro is based on "swapping" [SCH 81, REI 84]. Swapping consists of exchanging certain values of "sensitive" data between the different records of the base. The only constraint which can prevent swapping lies in the p reservation of certain statistical indicators of the sensitive data (average, variance, standard deviation,etc.). The interest of this solution is that only a certain margin of error is tolerated in the consultation of statistical databases [MCL 89]. Another method of proposed perturbation is based on the probalistic perturbation of data. At each record in the base a white noise (neutral) gaussien of average zero and standard deviation, is added to the real value of the sensitive data. then physically stocked. standard deviation) The statistical s, The "noised" value is characteristics (average and of the noise guarantee the coherence of the results obtained by the queries. We note: -d = (dl, d2, ..... , dn) the vector of real values of the sensitive data; -d' = (d'l, d'2, ..... , d'n) the vector of "noised" values of the same data; -C the number of records of the characterised set. In applying Chebychev's inequality we obtain the following formula: Prob ( sId') - sId) ~ Ici } S 0 / Ici * 02 This formula represents the probability that the relative error (the result obtained minus the real result divided by value, Icil is greater than the given s and is inferior to the number dependant on the standard deviation, the relative error and the num ber of records in the characterised set. This solution is not practical when the database is too large. We end this section by presenting a method based on a combination of perturbation methods and restriction methods [CHP 86]. This method consists of splitting the base into constructed characteristics using the indexed data in the base. These differents sets are called Basic query Sets 614 (BQS). These BQS are ordered in a tree-like structure. They stay accessible to the user and are called Sufficient Query Sets (SQS) if their values have a certain cardinality.If not they are called Insufficient Query Sets (I QS). A BQS is defined in this solution as the union of several SQS and IQS. In conclusion, it can said that research has long been concerned only with queries which use simple statistical operators. However the intensive use of data analysis in forecasting and decision support systems leads to mllltidimensional inferences. Thewo rk of PALLEY [PAL 86,87) on correlation and linear regression and the diophantine equations of ROWE [ROW 84) show that there is a danger of these methods not retaining the confidentiality of individual data. For this, we propose a solution based on discrimination techniques to prevent the deduction of confidential information. The choice of discrimination methods is justified by their intensive utilisation in fields as diverse as medecine (diagnostic help), meteorology (catastrophy forecasting) and marketing (client selection). In the following two sections, we will begin by presenting discrimination methods and some possibilities of inference and will conclude with the proposed solution. 4/ Discriminant Analysis 4.1/ Properties of Discriminant Analysis Discriminant analysis calls for a set of data which is made up of a set of N quantitative qualitative explained. are variables, variable of also called M modualities explanatory which This set of variables is sorted into defined using the value of the is M variables, the classe_s qualitative variable (or sets) variable and a to be which being observed. There are two distinct but complementary research methods based on discriminant analysis. The first is descriptive optics. This works by finding a linear corn bination of quantitative variables which explain the regroupment. The second method consists of creating a new observation which is uniquely defined by the quantitative variables of one of the existing classes. 615 4.2/ Discrimination and Inference To illustrate the phenomena of inference by discrimination, we consider a medical database variables be made up of P observations containing N quantitative (Atti) and a qualitative variable (Diagnostic). The database can described as follows: Obs(Attl, Att2, Att3, Attn, Diagnostic) In applying discriminant analysis to this base, a discriminant function which is a linear combination of the Atti is obtained. different statistical indicators of the Atti The calculation of ·the (average, variance, standard deviation, etc) gives the user an important knowledge of different initial classes. A very small variance of a quantitative variable in a given class will inform the user that its values are strongly concentrated. The inferential aspect of discrimination is assured by a classing proced ure defined using the discriminant function obtained in the descriptive phase. This clustering procedure generates a classment function which is validated, in general, by multiple regression which is itself a source of inference. Discrimination can also be used on qualitative variables [SAP 77] . 5/ Prosposed solution Classic procedures environments. of discrimination do not perform well in database In order to overcome this inconvenience, methods of stepwise discrimination are used to search for the most discriminating· variables. This search is carried out in several steps. At each step, the most discriminating variable is determined by its ability to discriminate mainly used in stepwise methods [LAG 83]. The steps needed to determine this variable are as follows: a/search for the variable with highest discriminating ability, that is the variable which best expresses the discrimination used; b/the same search as above on the remaining variables. The search is stopped when the increase in the discriminating ability of the selected variable, compared with that of the variable selected in the previous step, begins to slow down. At the end of the search there are two 616 groups of variables: the first gr oup contains variables with a high discriminating abilty and the second group consists of variables with low discriminating ability. Other types of tests, such as "badly classed", can also be used to stop the procedure framework for database [STS 89]. From these two sub-groups a protection can be At deduced. this level perturbation methods can play an important role. 6/ SAS/STAT and Inference In classic database management systems, the problem of inference control remains to be solved. The only existing protection consists of are database access rights and logical view mechanisms [CHA 75]. These are managed by the Administrator of the database (DBA). The implementation of the control techniques presented will require an important investment. Because of its integrated problems modular approach, posed by inference the in SAS the system database. already The responds different to the interfaces between SAS and database systems (Oracle, Db2, Rdb,etc.) solve the problem of the multidimensional inference by the use procedures CANDISC, DISCR and STEPDISC of SAS/STAT. \ \. ',' 617 of the discrimination References [CRA 75] D . CHAMBERLIN & J. GRAY & I. TRAIGER "Views, authorization and locking in relational database system" Proceedings of the National Computer conference, 1975, pp 425-430 [CHP 86] Y.H.CHIN & W.L.PENG "An Evaluation Of Two New Inference Control Methods" Proceedings of the Third international workshop on statistical and scientific databases, pp 294-302 [CUC 85] K.L.CHONG & J.C.CHOI & J.L.CHUNG "A Data Distorsion by Probability Distribution" ACM, TODS, Vol 10, n03, Sep 1985, pp 395-411 [DDS 79] D.J.DENNING & P.J.DENNING & M.D SCHWARTZ "The tracker: A threat to statistical database security" ACM, TODS, Vol 4, n01, Mar 1979, pp 76-96 [DES 83] D.E.DENNING & J.SCHLORER "Inference controls for statistical databases" Computer IEEE, Vol 16, n07, Jul 83, pp 69-82 [HOF 70] L.J HOFFMAN & W.F MILLER "Getting a personal dossier from a statistical data bank" Datamation, vol 16, n05, May 1970, pp 74-75 ;, i: [LAG 83] ~ Ii J.DE LAGARDE a Initiation l'Analyse des donnees Ed Dunod, 1983, Paris r [MCL 86] I, Ij M.Me LEISH "Prior knowledge and the Security of a Dynamic Statistical Database" f, ~, Proceedings ~ scientific databases, pp 303-305 "~. [MeL 89] i: t: 1:. the Third international [PAL 86] , r Results on the Security l' ~: ti, k J. ~~ ') t ~~ \- \ ~ ~: { 618 '~ '1 Dynamic Statistical M.A.PALLEY '"i: 0 Partitioned Second intern conf on data engineering, IEEE, Los Angeles, 1986, pp 67-74 .''" 0 of Correlationnal Modeling" ~;; :.~~ and "Security of Statistical Databases Compromise Through Attribute ~~, J statistical ACM, TODS, Vol 14, n01, Mar 1989, pp 98-113 1t'; :.:t "z on Databases" ~~ .f, 't~ -J workshop M.Me LEISH "Further i~,." of [PAL 87] .H~A.PALI.EY& J.S.SIMONOFF, "The I,lse Regression Methodologyf.or· theCOffipromise: of 'Con!identi,al Information in Statistical Databases" ACM, TODS, Vol 12,n·4,•. Dec 1987, pp 593:-608 [REI 84] S.P.REISS "Practical Data Swapping: The First Step" ACM, TODS, Vol 9, n'l, 1984, pp 20-37 [ROW 84] N.ReWE ".Diopl'l,antine Inferences. from Statistica.l Agregateson .Few-valued Attributs" ACM-SIGMOD, International ConfManagement of Data, Boston 1984 [SAP 77] "Une G.SAPORTA methode et un programme d'analyse discriminante pas a pas variables qualitatives" Colloque IRIA : Analyse des donnees et, informatique, Vol 1, pp 201-210 [STS 89] "Discriminant Analysis and Clustering" ,Statistical Science, Vol.4, n01, p~ 34-69 [TYW 84] J.F TRAUB & Y.YEMINI & H.WOZNIAKOWSKI "The statistical security of statistical database" 'ACM, TODS, Vol 9, n '.4, Dec 1984, pp 672-679 619 sur