Download Talk_TAU_template

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Tel Aviv
April 29th, 2007
Disclosure Limitation from a Statistical Perspective
Natalie Shlomo
Dept. of Statistics, Hebrew University
Central Bureau of Statistics
1
Topics of Discussion
1. Introduction and Motivation
2. Disclosure risk – data utility decision problem
3. Assessing disclosure risk
4. Methods for masking statistical data
- microdata
- tabular data
5. Assessing information loss
2
Statistical Data
Sources of Statistical Data:
•
Census - full enumeration of the population
•
Administrative – data collected by Government
Agencies for other purposes, i.e. tax records, population
register
•
Survey – random sample out of the population. Each unit
in the sample is assigned a sampling weight. Often
population is unknown.
SDC Approach – “Safe Data” versus “Safe Settings”
•
Microdata Review Panels need to make informed
decisions on releasing microdata and mode of access
3
Assessing Disclosure Risk
Physical disclosure – disclosure from breach of physical
security, e.g. Stolen questionnaires, computer hacker
Statistical disclosure – disclosure from statistical outputs
Disclosure risk scenarios - assumptions about information or IT tools
available to an intruder that increase the probability of disclosure,
e.g. matching to external files or spontaneous recognition
Key - combination of indirect identifying variables, such as sex, age,
occupation, place of residence, country of birth and year of
immigration, marital status, etc.
4
Types of Statistical Disclosure
Identity disclosure - an intruder identifies a data subject
confidentiality pledges and code of practice:
“…no statistics will be produced that are likely to identify an
individual unless specifically agreed with them”
Individual attribute disclosure - confidential information
revealed and can be attributed to a data subject
Identity disclosure a necessary condition for attribute
disclosure and should be avoided
Group attribute disclosure - learn about a group but not about a
single subject. May cause harm, i.e. all adults in a village collect
5
unemployment
The SDC Problem
R-U Confidentiality Map (Duncan, et.al. 2001)
Disclosure Risk:
Probability of
re-identification
Maximum Tolerable Risk
Original Data
Released Data
No data
Data Utility: Quantitative measure on the statistical quality
6
Disclosure Risk Measures
• Frequency tables with full population counts:
- 1’s and 2’s in cells lead to disclosure
-
0’s may be disclosive if only one non-zero cell in
a row or column
Disclosure risk quantified by the percentage of small cells,
probability that a high-risk cell is protected (take into account
other measurement errors, i.e. imputation rates)
• Magnitude Tables :
Sensitivity measures based on the number of
contributing units and the distribution of the target variable in
the cell
7
Disclosure Risk Measures
• Microdata from surveys (and frequency tables):
Decisions typically based on check lists and ad-hoc
decision rules regarding low frequencies in
combinations of identifying key variables
In recent years, objective quantitative criteria for
measuring disclosure risk when the population is
unknown:
- Probability that a sample unique is a population unique
- Probability of a correct match between a record in
the microdata to an external file
8
On Definitions of Disclosure Risk
•
In the statistics literature, we present examples of risk measures,
but lack formal definitions of when a file is safe
•
In the computer science literature, there is a formal definition of
disclosure risk (e.g., Dinur, Dwork, Nisim (2004-5), Adam and
Wortman(1989
In some of the CS literature any data must be released
with noise of magnitude 
n hides information
Adding noise of order
on individuals and small groups, but yields meaningful
information about sums of O(n) units for which noise of order
is naturaln
9
On Definitions of Disclosure Risk
Worst Case scenario of the CS approach, for example, that the
intruder has all information on everyone in the data set except
the individual being snooped, simplifies definitions and there is
no need to consider other, more realistic but more complicated
scenarios.
But would Statistics Bureaus and statisticians agree to adding noise
to any data?
Other approaches like query restriction or query auditing do not
lead to formal definitions.
10
On Definitions of Disclosure Risk
Collaboration with the CS and Statistical Community where:
1. In the statistical community, there is a need for more formal
and clear definitions of disclosure risk
2. In the CS community, there is a need for statistical methods to
preserve the utility of the data
- allow sufficient statistics to be released without
perturbation
- methods for adding correlated noise
- sub-sampling and other methods for data masking
Can the formal notions from CS and the practical approach of
statisticians lead to a compromise that will allow us to set
practical but well defined standard for disclosure risk?
11
SDC Methods for Microdata
Data Masking Techniques:
Non-perturbative methods – preserves the integrity of the data
(impact on the variance)
Examples: recoding, local suppression, sub-sampling,
Perturbative Methods - alters the data (impact on bias)
Examples: adding noise, rounding, microaggregation,
record swapping, post- randomization method, synthetic
data
12
SDC Methods for Microdata
• Additive noise
A random vector (for example, from a normal distribution) is generated
(with zero mean) independently for each individual in the microdata
and added to the continuous variables to be perturbed.
Use correlated noise based on target variables in order to ensure
equal means, covariance matrix and also preserves linear balance edits,
i.e. X+Y=Z
Let
Generate
x ~ iid (  ,  2 )
Calculate:
 ~ iid ( ,  2 )
x  1   2 x   2 
E ( x)  1   E ( x)   E ( )  E ( ) 
2
2
where controls the amount
of noise
   1 2
2
var( x)   2
13
SDC Methods for Microdata
• PRAM ( Post-randomisation method)
Misclassify categorical variables according to transition matrix P
and a random draw:
pij  p( perturbed category is j | original category is i )
For T * vector of the perturbed frequencies,
is an unbiased moment estimator of the data
Tˆ  T * P 1
TP  T (the vector of the original
Condition of invariance
frequencies is the eigenvector of P), perturbed file is unbiased
estimate of the original file.
Expected values of marginal distribution reserved. Can also ensure
exact marginal distributions by controlling the selection process for
changing records
Use control strata to ensure no silly combinations
14
SDC Methods for Microdata
PRAM Example: T`=(25, 30, 50, 10)
- Generate a Transition matrix with a minimum value on the diagonal
d  0.8)
and all other( pprobabilities
equal.
 0.8264

 0.0427
P
0.0479

 0.0598

0.0579
0.8718
0.0479
0.0598
0.0579
0.0427
0.8563
0.0598
0.0579 

0.0427 
0.0479 

0.8207 
 that final
- Calculate Invariant matrix R and determine
such
R*g  R g  (1   )I g will have the desired diagonals
matrix
 0.8478

 0.0413
*
R 
0.0370

 0.0716

0.0496
0.8764
0.0359
0.0674
0.0740
0.0598
0.9058
0.1067
0.0287 

0.0225 
0.0213 

0.7543 
Note that
T R*  T 
15
SDC Methods for Microdata
• Synthetic Data
- Fit data to model, e.g. using multiple imputation techniques
to develop posterior distribution of a population given the
sample data
- Can be implemented on parts of the data where a mixture is
obtained of real and synthetic data
- In practice, very difficult to capture all of the conditional
relationships between variables and within sub-groups
• Microaggregation
- Identify groups of records, e.g. of size 3, and replace values
by group mean (has been shown that it is easy to ‘unpick’
for one variable)
- Carry out on several variables at once using clustering
algorithms for reducing within variance
16
SDC Methods for Magnitude Tables
Cell Suppression
Region
Industry
36210
36220
36300
36
A
2,608 (5)
11,358
4,871
18,837
B
2,562 (3)
11,631
3,652
17,845
C
2,608 (12)
11,956
3,054
17,618
D
Suppress
12,281
3,051
17,641
E
2,240 (2)
7,347
3,537
13,124
12,327
54,573
18,165
85,065
Total
17
SDC Methods for Magnitude Tables
Cell Suppression
Industry
Region
36210
36220
36300
36
A
2,608 (5)
11,358
4,871
18,837
B
Secondary
11,631
Secondary
17,845
C
2,608 (12)
11,956
3,054
17,618
D
Suppress
12,281
Secondary
17,641
E
2,240 (2)
7,347
3,537
13,124
12,327
54,573
18,165
85,065
Total
18
SDC Methods for Magnitude Tables
Information Available to Table User
(1) T(1)+T(2)=12327-2608-2608-2240=4871
T(2)<= 4871 from (1)
(2) T(1)+T(3)=17845-11631=6214
T(4)<= 6703 from (3)
(3) T(3)+T(4)=18165-4871-3054-3537=6703
T(2)>=5360-6703=0 from
(4),(6)
(4) T(2)+T(4)=17641-12281=5360
etc…
(5) T(1)>0, (6) T(2)>0, (7) T(3)>0, (8) T(4)>0
Represent as matrix equation and vector inequality
A T=b, T >0
where
A=
1010
T=
T(1)
b=
6214
0101
T(2)
5360
1100
T(3)
4871
0011
T(4)
6703
19
SDC Methods for Magnitude Tables
Disclosure Protection
Determine upper and lower bounds for T(1), ….., T(4)
(feasibility intervals) using eight linear programming solutions.
1a
1a
2a
maximise T(1) subject to AT=b, T>0
minimise T(1) subject to AT=b, T>0
maximise T(2) subject to AT=b, T>0
There must be ‘feasible’ solutions and true values of T(X) will lie
within bounds.
Let bounds be T(1)L and T(1)U etc.
20
SDC Methods for Magnitude Tables
Disclosure Protection
2,608
11,358
4,871
18,837
[0 , 4871]
11,631
[1343 , 6214]
17,845
2,608
11,956
3,054
17,618
[0 , 4871]
2,240
12,281
7,347
[489 , 5360]
3,537
17,641
13,124
12,327
54,573
18,165
85,065
21
SDC Methods for Magnitude Tables
Choice of Secondary Cells
Stipulate requirement on T(1)L and T(1)U to ensure interval
sufficiently wide with a fixed percentage,
e.g. [T(X)U-T(X)L]≥ (p/100)T(X) for all X
Employ sensitivity measure:
Require T(X)U>T(X)+(p/100)T(X)
And by symmetry T(X)L<T(X)-(p/100)T(X)
Sliding rule protection – only the width is predetermined and interval
may be skewed
22
SDC Methods for Magnitude Tables
Choice of Secondary Cells
Many possible sets of suppressed cells (including all cells!), Define
target function and minimise subject to constraints for
preserving protection intervals
Idea: Target function: Cost = information content of cell
 C(X )
suppressed cells X
Common choices of C(X):
a) C(X)=1 minimise number of cells suppressed
b) C(X)=N(X) minimise number of contributors suppressed
c) C(X)=T(X) minimise total value suppressed (all cells must be
non-negative)
23
SDC Methods for Magnitude Tables
Choice of Secondary Cells
Hypercube method: Simple but not optimal
On a k-dimensional table, choose a k-dimensional hypercube with
the sensitive cell in a corner. All 2k corner points are suppressed
Criteria:
• Corner can’t be zero since structural zeros may allow
recalculating other corners
• Feasibility intervals should be sufficiently wide (intervals simpler
to calculate on a hypercube)
• Possible suppression candidates and choose one with minimal
information loss (minimize cost function)
• A priori choose cells that were previously suppressed to
minimize information loss by putting a large negative cost on
the suppressed cells
24
SDC Methods for Frequency Tables
Rounding
Round frequencies
- deterministic
e.g. to nearest 5
- random
e.g. to close multiple of 5
- controlled
e.g. to multiple of 5
}
Usually interior cells
and margins rounded
independently tables don’t add up
Margins = sum of
interior cells
Can implement rounding on only small cells of the table Margins
added up from perturbed and non-perturbed cells
25
SDC Methods for Frequency Tables
Rounding
Example - complete census
1
A
B
C
D
Total
2
0
5
6
4
15
3
1
2
1
7
11
Total
4
0
2
0
0
2
0
4
3
4
11
1
13
10
15
39
What types of disclosure risk are present in this table?
26
SDC Methods for Frequency Tables
Rounding
Deterministic rounding process to base 3
A
1
2
0 0
1
3
4
0
Total
0
0
1
0
00
B
5
6
2
3
2
3
4
3
13 12
C
6
6
1
0
0
0
3
3
10
9
D
4
3
7 6
0
0
4
3
15
15
15
11 12
2
3
11
12
39
39
Total
15
The published table
1
2
3
4
Total
A
0
0
0
0
0
B
6
3
3
3
12
C
6
0
0
3
9
D
3
6
0
3
15 27
15
12
3
12
Total
39
SDC Methods for Frequency Tables
Rounding
Random rounding algorithm:
• Let Floor (x ) be the largest multiple k of the base b such that
 x x.
for an bk
entry
• Define
res ( x)  x  Floor ( x)
• x is rounded up to
and rounded down to
( Floor ( x)  b)with probability
Floor (x )with probability
• If x is already a multiple of b, it remains unchanged
res ( x )
b
res ( x)
(1 
)
b
The expected value of the rounded entry is the original entry since:
res ( x)
res ( x)
)  ( x  ( Floor ( x)  b)) 
0
independently
in
the
table.
b
b
( x  Floor ( x))  (1 
Each small cell is rounded
Can also control the selection process to ensure additive totals in
one dimension.
28
SDC Methods for Frequency Tables 0 
1
Rounding
A typical rounding scheme
Margins rounded separately
A
B
C
D
Total
2
2
0 with probability 1/3
3 with probability 2/3
3
3
4
3 with probability 2/3
6 with probability 1/3
…...
An example of random rounding
1
0
0 with probability 2/3
3 with probability 1/3
3
4
Total
00
53
13
20
00
23
00
46
00
1315
66
10
00
33
1012
46
79
00
43
1515
1515
119
23
1112
3939
29
SDC Methods for Frequency Tables
Rounding
Complete census in small area, after random rounding
Age
benefit
claimed
not
claimed
Total
1625
20
2649
15
4059
15
6069
5
70- 80+ Total
79
5
0
60
25
10
5
0
0
0
45
40
30
15
0
5
5
105
- Ones and twos disappear
- Doubt cast on zeroes so disclosure prevented
- Figures don’t add up, may allow table to be “unpicked”
30
SDC Methods for Frequency Tables
Rounding
Controlled Rounding – rounding in such a way that table are
additive
Age
benefit
claimed
not
claimed
Total
1625
20
26- 4049 59
15 15
60- 70- 80+ Total
69 79
10 0
0
60
20
15
5
0
5
0
45
40
30
20
10
5
0
105
Ones and twos disappear
- Doubt cast on zeros so disclosure prevented
- Tables additive
- Zero-restricted – the entry that is an integer multiple of
the base b is unchanged
-
31
Auditing Random Rounding
Example - Random Rounding to base 5
Under 30
30-60
Over 60
Total
Male
6
7
1
14
Female
5
4
0
9
Total
11
11
1
23
Under 30
30-60
Over 60
Total
Male
6
5
7
10
1
0
14
10
Female
5
5
4
5
0
0
9
5
Total
11
15
11
15
1
0
23
20
• Feasible interval generally 8 wide (between 1 and 9), except
for column 3 which is 4 wide (between 0 and 4)
• Column one and row one do not add up to totals, nor oneway margins to grand total
32
Auditing Random Rounding
Example
Under
30
30-60
Over 60
Male
10
Female
Total
Total
5
15
15
0
20
Restrict attention just to one-way margins and total.
33
Auditing Random Rounding
Under 30
30-60
Over 60
15
11-19
15
11-19
0
0-4
Male
Female
Total
•
Total
10
6-14
5
1-9
20
16-24
Feasible intervals
34
Auditing Random Rounding
Under 30
30-60
Over 60
15
11-12
15
11-12
0
0-1
Male
Female
Total
•
Total
10
13-14
5
8-9
20
22-23
Feasible intervals sometimes much narrower than
the rounding method suggests.
• In some cases, where frequencies low, can result in
potential disclosure.
35
Impact on Analysis
• Loss of information – combining categories
• Inflate or deflate variance
Var (ˆ(datanew ))
IL 
Var (ˆ(dataold ))
• Bias and inconsistency in the data
Bias (ˆ(datanew ))  E(ˆ(datanew ))  
Some SDC methods are transparent and users can take them into
account, e.g. rounding. Other methods have hidden bias and the
effects are difficult to assess, e.g. record swapping
36
Information Loss Measures
Types of Information loss measures depending on use of
data:
• Distortion to distributions and totals (bias) as measured by
distance metrics, entropy, average perturbation per cell, etc.
• Impact on variance of estimates
• Impact on measures of association based on chi-squared tests for
independence
• Impact on goodness-of-fit criteria, regression coefficients,
statistical analysis and inference
37
Information Loss Measures
Measures for Bias and Distortion
• Hellinger’s Distance
1 |OA|
HD( Dorig , D pert ) 

| OA | k 1
1
k
k
( D pert
(c)  Dorig
(c)) 2

ck 2
• Relative Absolute Distance
k
k
1 |OA| | Dpert (c)  Dorig (c) |
RAD( Dorig , Dpert ) 

k
| OA | k 1 ck
Dorig
(c)
• Average Absolute distance per cell
|OA|
| D
1
ck
AAD( Dorig , D pert ) 

| OA | k 1
Method
SCA
k
pert
k
(c)  Dorig
(c ) |
where
|k|
CSCA
CRND
HD
5.272
5.279
5.416
RAD
76.804
76.824
84.641
AAD
0.629
0.630
1.021
| k |  I (c  k )
c
SCA – small
cell rounding
CRND – semi
controlled full
rounding
38
Information Loss Measures
Measures for Bias and Distortion
k
k
k
k
AD( Norig
, N pert
)  N pert
(C)  Norig
(C)
R - random
R/I – random (no imputed)
for 10 consecutive OA’s
T - targeted
39
Information Loss Measures
Impact on Measure of Association – Cramer’s V
On two-Way Table defined by OA * Age-Sex and
Economic Activity * Long-Term Illness calculate:
2
CV 
n
min( R  1), (C  1)
The information loss measure:
RCV ( Dorig , D pert )  100 
Method
CV ( Dorig )  CV ( D pert )
CV ( Dorig )
Cramer’s V=0.2021
SJ
Method
1%
10%
20%
Random
0.3%
2.8%
4.8%
Rand/Imp
0.3%
2.0%
3.8%
Targeted
0.1%
1.4%
3.3%
SJ
Cramer’s V=0.2021
SCA
Original
-6.8%
CSCA
-6.7%
CRND
-7.8%
40
Disclosure Control Techniques
Record Swapping
Region SJ
Random
Rand/Imp
Targeted
Percent Unperturbed
in Small Cells
100%
1%
80%
10%
1
60%
20%
40%
10%
20%
20%
0%
1.4
1.2
1
0.8
0.6
0.4
Average Perturbation Per Cell
0.2
0
Disclosure risk measure – Percent records in small cells of the tables that
were not perturbed or not imputed
Information Loss measure – Average absolute difference per cell
AAD( Dorig , D pert ) 
| D
cC
orig
(c)  D pert (c) |
|C |
41