Download ABSTRACT : Inferences are of major concern for database

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical computer science wikipedia , lookup

Pattern recognition wikipedia , lookup

Predictive analytics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Neuroinformatics wikipedia , lookup

Corecursion wikipedia , lookup

Regression analysis wikipedia , lookup

Perturbation theory wikipedia , lookup

Data analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Inferences Control in Data Bases
A. BEKHECHI
ID-Partners
ABSTRACT : Inferences are of major concern for database managers, particularly in
statistical data bases. This problem became apparent when it was discovered that
a number of requests on a given set of data could lead to the determination of
otherwise
hid
den
data.
inferences led first
then
to
data
statistical
analysis)
search
for
methods
to "tracker" type methods
perturbation
methods
has
The
on
resulted
methods.
databases
in
The
(linear
another
to
databases
(or restrictive methods),
intensive
regre
important
protect
use
ssion,
source
of
of
from
and
multidimensional
principal
components
inferences
that
are
difficult to detect in data bases.
The objective of this paper is to present a solution for protecting data bases
against
inferences
using
data
analysis
methods,
specifically
discriminant
analysis. The solution is based on the discriminanting power of each attribute
in the protected data base.
Keywords
Databases,
Confidentiality,
Analysis.
612
,
,
Security,
Data
Flow,
Statistical
1/ Introduction
Security in a
concerns
data
database consists of two distinct
functional
protection.
security whereas
Functional
the
security
levels.
second level
includes
The first
is
database
level
concerned with
access
rights
as
well as what will happen in the case of a computer breakdown or software
error.
In relational
database management
systems
this
security
level
is
well assured and is based on already proven solutions used in operating
systems such as passwords to control access and specific mechanisms like
recovery.
The
second security level
is
the
cause
of much
conCern among
systems managers and systems developers because of the confidentiality of
certain information managed by the RDBMS.
was first
worry
mentioned in the 1970' s
is
even
in
greater
a
the following sections,
by HOFFMAN and MILLER
statistical
existence of two types of bases:
The seriousness of this problem
database
(SDB)
[HOF
70].
because
This
of
the
micro databases and macro databases _ In
we will present the current state of the illrt and
our work on this subject.
2/ State of the art
Inference
control
methods
can
be
divided
into
two
distinct
categories:
restrictive methods and perturbation methods. The phenomena of inference is
possible
only
database's
well
when
the
contents
presented in
[MCL
an
"intruder"
86,89].
article
possesses
Work
a
prior
con-cerning
by DENNING
and
knowledge
of
restrictive methods
SCHLORER
[DES
83].
the
is
In the
following section inference control methods based on perturbation will be
presented.
3/ Perturbation methods
Perturbation methods can be divided into three categories: the perturbation
of
selected
"perturbation
records
in
sensitive"
the
data
characterised
values
and
the
set,
the
perturbation
perturbation
of
the
of
query
results. Data perturbation can take place in two differents ways: constant
over time or variable.
are
based
on
The main "constant over time" perturbation methods
systematic
rounding
techniques
in
which
the
systematic
intervals and the random modifications are fixe d in time. These techniques
resist very little to attacks of the "tracker" type [DDS 79]. Perturbations
613
which vary in time are mainly represented by random rounding and random
interval methods. The random difference of average zero method used in the
sum query is also important. These techniques give non-biased results. They
are
resistant
independant,
to
but
"trackers"
resist
in
little
successive equivalent queries
databases
cases
to
where
inferences
.their
by
randomnesses
averaging
are
results
of
[CUC 85]. One method oriented towards micro
is based on "swapping"
[SCH 81,
REI 84].
Swapping consists of
exchanging certain values of "sensitive" data between the different records
of the base. The only constraint which can prevent swapping lies in the p
reservation
of
certain
statistical
indicators
of
the
sensitive
data
(average, variance, standard deviation,etc.). The interest of this solution
is that only a certain margin of error is tolerated in the consultation of
statistical databases
[MCL 89]. Another method of proposed perturbation is
based on the probalistic perturbation of data. At each record in the base a
white noise
(neutral)
gaussien of average zero and standard deviation,
is added to the real value of the sensitive data.
then
physically
stocked.
standard deviation)
The
statistical
s,
The "noised" value is
characteristics
(average
and
of the noise guarantee the coherence of the results
obtained by the queries. We note:
-d
=
(dl,
d2, ..... , dn)
the
vector
of
real values
of
the sensitive
data;
-d'
=
(d'l, d'2, ..... , d'n)
the vector of "noised" values of the same
data;
-C
the
number
of
records
of
the
characterised
set.
In
applying
Chebychev's inequality we obtain the following formula:
Prob ( sId') - sId) ~
Ici }
S 0 /
Ici *
02
This formula represents the probability that the relative error (the result
obtained minus the real result divided by
value,
Icil
is greater than the given s
and is inferior to the number dependant on the standard deviation,
the relative error and the num ber of records in the characterised set.
This solution is not practical when the database is too large. We end this
section
by presenting
a
method based
on
a
combination
of
perturbation
methods and restriction methods [CHP 86]. This method consists of splitting
the base into constructed characteristics using the indexed data in the
base. These differents sets are called Basic query Sets
614
(BQS).
These BQS
are ordered in a tree-like structure. They stay accessible to the user and
are
called
Sufficient
Query
Sets
(SQS)
if
their
values
have
a
certain
cardinality.If not they are called Insufficient Query Sets (I QS). A BQS is
defined in this solution as the union of several SQS and IQS.
In conclusion, it can said that research has long been concerned only with
queries which use simple statistical operators. However the intensive use
of data analysis in forecasting and decision support systems leads to
mllltidimensional inferences. Thewo rk of PALLEY [PAL 86,87) on correlation
and linear regression and the diophantine equations of ROWE [ROW 84) show
that there is a danger of these methods not retaining the confidentiality
of individual data. For this, we propose a solution based on discrimination
techniques to prevent the deduction of confidential information. The choice
of discrimination methods is justified by their intensive utilisation in
fields as diverse as medecine (diagnostic help), meteorology (catastrophy
forecasting)
and marketing (client selection). In the following two
sections, we will begin by presenting discrimination methods and some
possibilities of inference and will conclude with the proposed solution.
4/ Discriminant Analysis
4.1/ Properties of Discriminant Analysis
Discriminant analysis calls for a set of data which is made up of a set of
N
quantitative
qualitative
explained.
are
variables,
variable
of
also
called
M modualities
explanatory
which
This set of variables is sorted into
defined
using
the
value
of
the
is
M
variables,
the
classe_s
qualitative
variable
(or sets)
variable
and
a
to
be
which
being
observed. There are two distinct but complementary research methods based on
discriminant
analysis.
The
first
is
descriptive
optics.
This
works
by
finding a linear corn bination of quantitative variables which explain the
regroupment. The second method consists of creating a new observation which
is uniquely defined by the quantitative variables of one of the existing
classes.
615
4.2/ Discrimination and Inference
To illustrate the phenomena of inference by discrimination, we consider a
medical
database
variables
be
made
up
of
P
observations
containing
N quantitative
(Atti) and a qualitative variable (Diagnostic). The database can
described
as
follows:
Obs(Attl,
Att2,
Att3,
Attn,
Diagnostic)
In
applying discriminant analysis to this base, a discriminant function which
is a
linear combination of the Atti is obtained.
different statistical indicators of the Atti
The calculation of ·the
(average,
variance,
standard
deviation, etc) gives the user an important knowledge of different initial
classes. A very small variance of a quantitative variable in a given class
will
inform
the
user
that
its
values
are
strongly
concentrated.
The
inferential aspect of discrimination is assured by a classing proced ure
defined using the discriminant function obtained in the descriptive phase.
This
clustering
procedure
generates
a
classment
function
which
is
validated,
in general, by multiple regression which is itself a source of
inference.
Discrimination can also be used on qualitative variables
[SAP
77] .
5/ Prosposed solution
Classic
procedures
environments.
of
discrimination
do
not
perform
well
in
database
In order to overcome this inconvenience, methods of stepwise
discrimination are used to search for the most discriminating· variables.
This
search
is
carried
out
in
several
steps.
At
each
step,
the
most
discriminating variable is determined by its ability to discriminate mainly
used in stepwise methods [LAG 83].
The steps needed to determine this variable are as follows:
a/search for the variable with highest discriminating ability,
that is
the variable which best expresses the discrimination used;
b/the same search as above on the remaining variables.
The search is stopped when the increase in the discriminating ability of
the selected variable, compared with that of the variable selected in the
previous step, begins to slow down. At the end of the search there are two
616
groups
of
variables:
the
first
gr
oup
contains
variables
with
a
high
discriminating abilty and the second group consists of variables with low
discriminating ability. Other types of tests, such as "badly classed", can
also be used to stop the procedure
framework
for
database
[STS 89]. From these two sub-groups a
protection
can
be
At
deduced.
this
level
perturbation methods can play an important role.
6/ SAS/STAT and Inference
In classic database management systems,
the problem of inference control
remains to be solved. The only existing protection consists of are database
access rights and logical view mechanisms
[CHA 75].
These are managed by
the Administrator of the database (DBA). The implementation of the control
techniques presented will require an important investment. Because of its
integrated
problems
modular
approach,
posed by
inference
the
in
SAS
the
system
database.
already
The
responds
different
to
the
interfaces
between SAS and database systems (Oracle, Db2, Rdb,etc.) solve the problem
of
the
multidimensional
inference
by
the
use
procedures CANDISC, DISCR and STEPDISC of SAS/STAT.
\
\.
','
617
of
the
discrimination
References
[CRA 75]
D . CHAMBERLIN & J. GRAY & I. TRAIGER
"Views, authorization and locking in relational database system"
Proceedings of the National Computer conference, 1975, pp 425-430
[CHP 86]
Y.H.CHIN & W.L.PENG
"An Evaluation Of Two New Inference Control Methods"
Proceedings
of
the
Third
international
workshop
on
statistical
and
scientific databases, pp 294-302
[CUC 85]
K.L.CHONG & J.C.CHOI & J.L.CHUNG
"A Data Distorsion by Probability Distribution"
ACM, TODS, Vol 10, n03, Sep 1985, pp 395-411
[DDS 79]
D.J.DENNING & P.J.DENNING & M.D SCHWARTZ
"The tracker: A threat to statistical database security"
ACM, TODS, Vol 4, n01, Mar 1979, pp 76-96
[DES 83]
D.E.DENNING & J.SCHLORER
"Inference controls for statistical databases"
Computer IEEE, Vol 16, n07, Jul 83, pp 69-82
[HOF 70]
L.J HOFFMAN & W.F MILLER
"Getting a personal dossier from a statistical data bank"
Datamation, vol 16, n05, May 1970, pp 74-75
;,
i:
[LAG 83]
~
Ii
J.DE LAGARDE
a
Initiation
l'Analyse des donnees
Ed Dunod, 1983, Paris
r
[MCL 86]
I,
Ij
M.Me LEISH
"Prior knowledge and the Security of a Dynamic Statistical Database"
f,
~,
Proceedings
~
scientific databases, pp 303-305
"~.
[MeL 89]
i:
t:
1:.
the
Third
international
[PAL 86]
,
r
Results
on
the
Security
l'
~:
ti,
k
J.
~~
')
t
~~
\-
\
~
~:
{
618
'~
'1
Dynamic
Statistical
M.A.PALLEY
'"i:
0
Partitioned
Second intern conf on data engineering, IEEE, Los Angeles, 1986, pp 67-74
.''"
0
of
Correlationnal Modeling"
~;;
:.~~
and
"Security of Statistical Databases Compromise Through Attribute
~~,
J
statistical
ACM, TODS, Vol 14, n01, Mar 1989, pp 98-113
1t';
:.:t
"z
on
Databases"
~~
.f,
't~
-J
workshop
M.Me LEISH
"Further
i~,."
of
[PAL 87]
.H~A.PALI.EY&
J.S.SIMONOFF,
"The I,lse Regression Methodologyf.or· theCOffipromise: of 'Con!identi,al
Information in Statistical Databases"
ACM, TODS, Vol 12,n·4,•. Dec 1987, pp 593:-608
[REI 84]
S.P.REISS
"Practical Data Swapping: The First Step"
ACM, TODS, Vol 9, n'l, 1984, pp 20-37
[ROW 84]
N.ReWE
".Diopl'l,antine Inferences. from Statistica.l Agregateson .Few-valued Attributs"
ACM-SIGMOD, International ConfManagement of Data, Boston 1984
[SAP 77]
"Une
G.SAPORTA
methode
et
un
programme
d'analyse
discriminante
pas
a
pas
variables qualitatives"
Colloque IRIA : Analyse des donnees et, informatique, Vol 1, pp 201-210
[STS 89]
"Discriminant Analysis and Clustering"
,Statistical Science, Vol.4, n01, p~ 34-69
[TYW 84]
J.F TRAUB
& Y.YEMINI & H.WOZNIAKOWSKI
"The statistical security of statistical database"
'ACM, TODS, Vol 9, n '.4, Dec 1984, pp 672-679
619
sur