Download Statistical Disclosure Risk: Separating Potential and Harm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Statistical Disclosure Risk:
Separating Potential and Harm
Chris Skinner
Department of Statistics, London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK.
Email: [email protected]
Summary
Statistical agencies are keen to devise ways to provide research access to data while protecting confidentiality. Although methods of statistical disclosure risk assessment are now well established in the statistical science literature,
the integration of these methods by agencies into a general scientific basis for their practice still proves difficult.
This paper seeks to review and clarify the role of statistical science in the conceptual foundations of disclosure risk
assessment in an agency’s decision making. Disclosure risk is broken down into disclosure potential, a measure of the
ability to achieve true disclosure, and disclosure harm. It is argued that statistical science is most suited to assessing
the former. A framework for this assessment is presented. The paper argues that the intruder’s decision making and
behaviour may be separated from this framework, provided appropriate account is taken of the nature of potential
intruder attacks in the definition of disclosure potential.
Key words: Confidentiality; disclosure limitation; identification; intruder.
1
1
Introduction
The challenge of devising ways to provide researchers with access to microdata and other statistical outputs while protecting
confidentiality continues to be the subject of intense interest and development around the world. Given the wide variety of
contexts in which this challenge arises, it seems unlikely that these developments will converge on any single solution. National Research Council (2005, p.68) recommends that “data produced or funded by government agencies should continue
to be made available for research through a variety of modes, including various modes of restricted access to confidential
data and unrestricted access to public-use data altered in a variety of ways to maintain confidentiality”.
The prototypical set-up is depicted in Figure 1. An agency undertakes a survey (or similar exercise) in which data are
collected from data subjects. The agency has a remit to use these data to produce outputs which will serve the statistical
needs of the agency’s users. The agency faces the potential problem of statistical disclosure, however, that one of the users
may be able to use the outputs to disclose information about one of the data subjects. A (hypothetical) user who ‘misuses’
the outputs in this way is called an intruder. To address this problem, the agency needs to decide what mode of access to
employ and, given this mode of access, what statistical disclosure limitation (SDL) methods to employ in the specification
of the statistical outputs. We include in the scope of SDL not only methods which modify data by restriction of detail,
perturbation (Willenborg and De Waal, 2001, Ch. 4, 5) or conversion into synthetic data (Reiter, 2009), but also methods
employed to restrict the outputs which a researcher can obtain from a remote access analysis server (Gomatam et al., 2005a)
or can take out from a research data centre (National Research Council, 2005, pp. 29-31).
Experience with different modes of access can lead to consensus on ’best practice’ for some features of these approaches.
Nevertheless, it is widely recognized that questions about when confidentiality is adequately protected by an SDL method
and how to come to that judgement remain hard ones to answer across most modes of access. The decision problem is often
posed in a ‘risk-utility’ framework (Duncan et al., 2011, Ch. 6). The extent to which confidentiality is protected is assessed
by measure(s) of what is called statistical disclosure risk. The extent to which the outputs satisfy the statistical purposes
for which they are produced is assessed by measure(s) of utility. An agency has to trade off these two criteria. The fact
that attempts to reduce disclosure risk often also lead to a decrease in utility is described by Doyle et al. (2001, p.1) as “a
fundamental tension at the heart of every statistical agency’s mission”.
Research into the assessment of disclosure risk has a long history in statistical science, going back at least to the 1970s,
e.g. Fellegi (1972) and Dalenius (1977). It also remains an active current area of research, e.g. at the interface with computer
2
Data Subjects
A
A
A
Data
A
AU
? Agency
A
A
A
Outputs
'
?
Users
&
A
AU
Intruder
$
%
Figure 1: The Different Parties in the Prototypical Set-up where Statistical Disclosure may arise
science (Dwork et al., 2006; Abowd et al., 2009; Wasserman and Zhou, 2010). Why then does it remain difficult to use this
body of work to support decisions about statistical disclosure in practice? Cox et al. (2011) consider that the risk-utility
framework does provide a common language for thinking about confidentiality but they argue, through examples, that it is
much less useful in practice and conclude that “today there is not a science of data confidentiality”.
In this paper it will be argued that there is a place for statistical science in disclosure risk assessment, but that it is
important to separate out what can be achieved by statistical science and what aspects of decision making require other
inputs, such as policy judgements. To help obtain this separation, we propose to break down the notion of disclosure risk.
The term ‘risk’ has many meanings across scientific disciplines and we take as our starting point its use in risk management (Vaughan, 1997) as an uncertain event with potentially adverse consequences. The risk of statistical disclosure then
has two dimensions: the probability of statistical disclosure and a measure of the harmful consequences which may ensue if
statistical disclosure occurs. Consideration of both dimensions may be found in the statistical literature. Much attention is
devoted to the probability of disclosure (e.g. Reiter, 2005). A number of papers (see end of Section 2.2) suggest, however,
that disclosure risk should also embrace the notion of harm. A difficulty in formulating a definition for this two-dimensional
representation of disclosure risk, however, is to decide whether statistical disclosure should refer to true disclosure or, more
3
broadly, to any kind of attack and claimed disclosure, whether true or not. We argue in Section 2.2 that, in the context of
many official confidentiality statements, it is appropriate that the probability of disclosure refers to true disclosure. On the
other hand, we consider that when an agency wishes to assess and manage the potential harm from threats to confidentiality
it is appropriate to adopt the broader definition. In summary, we propose to break down the notion of disclosure risk into
two separate notions: a measure summarising the uncertainty about whether true disclosure occurs, which we shall refer to
as disclosure potential (or identifiability when disclosure is defined as identification) and a measure of the adverse consequences of potential attempts at disclosure (whether successful or not), which we shall refer to as disclosure harm. We shall
argue that statistical science is primarily suited to assessing disclosure potential but not disclosure harm.
A further source of complexity in developing methodology for decision making about disclosure is the existence of the
multiple parties in Figure 1, each of which may make decisions. We shall take the agency’s decisions as the central ones
which statistical science is needed to support, but we shall seek to clarify the link between these decisions and the intruder’s
perspective. Fienberg et al. (1997, p.75) argue that “a data collection agency must consider disclosure from the perspective
of an intruder in order to efficiently evaluate data disclosure limitation procedures”. We shall review this idea.
The aim of the paper is then to review, explore and clarify the conceptual basis and logic of decisions by a statistical
agency relating to disclosure risk. In particular, we seek to clarify the role of statistical science in the agency’s decision
making and, by separation of tasks, to limit the amount of non-statistical assumptions (about matters such as intruder
judgements and behaviour) required for disclosure risk assessment.
The paper is intended to support the work of statistical agencies in two ways: first, by offering alternative ways of
providing transparent and defensible bases of SDL decisions and second, by facilitating the division of labour between statistical methodologists and others, such as policy analysts, within a statistical agency regarding the preparation of evidence
to support SDL decisions.
Given the breadth of the field, we first set some limits to the scope of the paper:
• we recognize, as discussed by National Research Council (2005, pp. 55-59), that confidentiality breaches may occur
for reasons of carelessness, illegal intrusions, law enforcement and national security as well as statistical disclosure,
but we restrict attention to the latter;
• we focus on broader conceptual issues and not the detail of specific SDL techniques;
• we shall not discuss utility in any detail;
4
• while decision theoretic ideas underly parts of the development, we shall not attempt any detailed formalization;
• we do not seek to provide a comprehensive review of the literature; rather we aim to identify and discuss what are
judged to be the key ideas;
• we refer to an agency, with government statistical agencies in mind, since these have been the main drivers of developments in SDL methodology, but our discussion should be relevant to other kinds of organisation with responsibilities
both to disseminate statistical data products to multiple users and to protect the confidentiality of the sources of the
data;
• we restrict attention to the prototypical set-up for a single agency, outlined in this section, which provides a basis for
discussing key conceptual issues in SDL methodology, and do not consider e.g. secure multi-party computation (Karr
et al., 2005; Fienberg, 2006) for distributed databases across several agencies;
• we use the term survey to denote the source of the data, but most of what we say will also be relevant to other sources
such as a census or an administrative source.
The paper is organised as follows. We contrast the agency’s and the intruder’s perspectives in Sections 2 and 3, starting
by considering the agency as a decision maker, which is our principal interest. We then proceed to consider the notion of
disclosure risk in more detail in Section 4. We reflect further on the role of harm in Section 5 and provide some conclusions
in Sections 6 and 7.
2
The Agency as Decision Maker
In deciding what actions to take to protect against statistical disclosure, the agency needs to consider:
• the nature of the alternative actions;
• the loss to the agency arising from the consequences of its actions.
Zaslavsky and Horton (1998) provide an example, where the agency needs to decide between two actions, whether to
publish an output or not, and the loss to the agency as a result of these actions is defined in terms of the threat of disclosure,
if the agency decides to publish, or the loss of information, if the agency takes the alternative action.
5
We discuss the nature of the actions and the loss in Sections 2.1 and 2.2, respectively. We shall return to outlining the
decision framework in general terms in Section 2.3.
2.1
The Agency’s Actions
We distinguish two kinds of actions which an agency may take when confronted by potential threats to confidentiality:
• the use of SDL methods as part of the process of determining the statistical outputs: examples include techniques
which transform or perturb outputs, such as recoding, sampling or creation of missing values, adding noise, swapping
values, replacing microdata by synthetic data;
• additional disclosure management actions to protect against and discourage disclosure attempts and the misuse of
outputs: examples include access controls, where users must sign up to certain conditions in a licence agreement,
which may include penalties for misuse; or training of users in why confidentiality needs to be protected.
See e.g. Willenborg and De Waal (2001) and Doyle et al. (2001) for references to the statistical literature on SDL
methods and National Research Council (2005, Ch. 2, 5) for broader issues of disclosure management.
2.2
The Loss to the Agency
In order to decide which action to take in the light of possible threats, the agency needs to be able to evaluate the potential
consequences of its decisions. In this section, we discuss the nature of the principal criteria relevant to this evaluation and
refer to these as loss criteria. The need for multiple criteria makes the decision a complex value problem in the terminology
of Keeney and Raiffa (1976, Sect. 1.4.1). We address the question of how to accommodate these multiple criteria when
making decisions in Section 2.3.
It is natural for the agency’s obligations to protect confidentiality to provide the basis of key loss criteria. These obligations might be legal, professional or ethical. For example, it is stated in the United States Code Title 13 that the Census
Bureau is prohibited from producing outputs “whereby the data furnished by any particular establishment or individual
under this title can be identified”. Such obligations are frequently expressed in terms of the ability of anyone with access to
the outputs to use them to achieve disclosure or breach confidentiality. We conceive of this ability in terms of the potential
to infer the value of a target y, embracing the possibilities of:
6
• predictive disclosure: where y is the value of a survey variable for a particular known individual or other unit;
• identity disclosure (also called identification): where y is the value of a binary variable indicating whether a particular
element of the output (e.g. a record in a microdata file) belongs to a particular known individual or other unit.
The Title 13 example above refers to identity disclosure, and the ability to infer this may be called identifiability. This
term is used by Government Statistical Service (2009, p.12) to explain the requirements for confidentiality protection in the
UK Code of Practice for Official Statistics: an output “will not usually directly reveal the identity of an individual, but more
usually the risk is that the statistic may make an individual identifiable - the individual has not yet been identified, but it
may be possible to do so”.
We shall represent inference about y by a predictive probability distribution p(y|O, D, attack method), which depends
on O, the statistical output, and D the auxiliary data available to an intruder who may seek to learn about y using a particular
attack method. There will generally be a set of such distributions for different targets y (corresponding to different data
subjects and variables) and different possible sources of auxiliary data, Dk , and attack methods k (k = 1, . . . , K).
We take the agency’s first loss criterion to be a summary of these predictive distributions in what we call the disclosure
potential, (or identifiability in the case of identity disclosure) expressed as the function: H(O; Dk , attack method k; k =
1, . . . , K). The agency is assumed here to have integrated out the uncertainty in the predictive distributions for the various
possible targets y. The form of H will depend on what the agency judges to be its obligations. If, for example, these
relate to identification and the agency focuses on the worst case of identification, H could be the maximum value of
p(y|O, D, attack method) across all identity indicators y for different output elements and data subjects. Alternatively, if
the agency focuses on how often this probability exceeds 0.5 or some other threshold, H might count the number of data
subjects for which this is the case for some element of the output.
We shall discuss the notion of predictive probability distribution further in Section 4. We draw on the use of this distribution by Duncan and Lambert (1986) as a unifying representation of disclosure risk, but do not follow their interpretation of
this distribution in terms of the intruder’s perspective, as will be discussed in Section 3. Note also that there is a significant
part of the disclosure limitation literature where disclosure relates to deductive (mathematical) inference, e.g. the notions
implicit in the p% rule (Cox, 2001) and not probabilistic inference as supposed here. In principle, it would seem that such
measures could also be combined into a disclosure potential function, since this is not required to take a probabilistic form,
but we shall not explore this possibility further here.
7
Although agencies’ legal and professional obligations to protect confidentiality appear most often to be expressed in
terms of inferential criteria with disclosure viewed as the potential outcome of concern, there are also ethical and other
reasons why agencies will wish to take account of the possible undesirable consequences of possible disclosure or, more
generally, of any actions by intruders designed to breach confidentiality. We refer to such consequences as disclosure harm
(c.f. Lambert, 1993). Such consequences may refer to:
• respondents: for example the impact of whatever actions an intruder might take following an attempt at disclosure,
whether the resulting disclosure is correct or false; such harm might be ’legal, reputational, financial or psychological’
(National Research Council, 2007, p.14) and
• the agency: for example, the impact of publicity about reported disclosures on the agency’s reputation and on response
rates to surveys run by the agency (Couper et al., 2010).
Disclosure harm is our second loss criterion. We discuss its definition further in Section 5. We note that the assessment
of harm may need to take account of considerable uncertainty about potential intruder actions, but we treat this uncertainty
as quite separate from the inferential uncertainty arising in the assessment of disclosure potential.
Both disclosure potential and disclosure harm are criteria which capture the possible consequences of the agency’s
actions for respondents and for the agency. It is also, of course, essential to consider the consequences for the genuine
users of the outputs. This impact is represented by the utility of the output, which refers to how far it meets the analytical
needs of users. Many SDL methods can impact negatively on these needs either through loss of information, e.g. loss of
geographical detail, or though reductions of data quality, e.g. by affecting the biases and variances of users’ estimators.
These impacts are of crucial importance when evaluating SDL methods but, as noted earlier, are not the main focus of this
paper. For further discussion see Karr et al. (2006) and Woo et al. (2009).
In summary, we have argued that the loss to the agency be based upon three principal criteria:
• disclosure potential;
• disclosure harm;
• utility of output to users.
The distinction between disclosure potential and harm is fundamental in this paper, in contrast to some of the literature
where the notion of harm is subsumed within that of risk. For example, Duncan et al. (2001b) view disclosure risk, in its
8
most general form, as including the consequences of intrusion both for the agency and for the intruder, and this enables
risk-utility analyses to have more general applicability. Dobra et al. (2003) also subsume harm within risk by referring to
disclosure risk as “the extent to which [the output’s] release can harm the agency or the data providers”. To reiterate, our
distinction between disclosure potential and harm is that the former refers to the ability of the intruder to disclose (i.e. infer)
information, whereas the latter refers to consequences of intruder actions (following a deliberate or inadvertent attempt at
disclosure). A related distinction between disclosure risk and harm is presented by Lambert (1993). She restricts disclosure
risk to considerations of identity disclosure, whereas it is assumed here that disclosure potential may also refer to predictive
disclosure. She states that she considers the latter form of disclosure only to the extent that it may lead to disclosure harm.
In this paper, we suppose that identity disclosure could also lead to harm. Our distinction is similar to that in National
Research Council (2007, pp. 13-15) and to that between disclosure harm and ’discredit harm’ in Trottini (2001).
2.3
The General Decision Problem
The elements of the decision framework outlined in the two previous sections are brought together in Figure 2. The two
components of the agency’s actions (SDL methods and disclosure management) each impact directly on what analyses the
users can conduct and on the environment in which these analyses are undertaken and hence each component has direct
consequences for utility. Our primary focus here, however, is on the other loss criteria.
Disclosure potential is defined in terms of the capacity of an intruder to make inferences. The nature of these inferences
depend upon the nature of the statistical outputs (and hence the SDL method) and the information additionally available to
the intruder. The extent to which disclosure potential may depend additionally upon the way in which an intruder might
attempt disclosure and, indirectly, upon the agency’s management approach to discourage such an attempt is discussed in
Section 4.
The actions of the intruder, as represented in Figure 2, are taken to refer to any actions resulting from an attempt at
disclosure. Examples include a journalist publishing a claim that they have identified a known individual in an output and
possibly disclosing information about that person or a commercial intruder linking information from the output into a credit
referencing database (Paass, 1988). It is supposed that the nature of the actions will depend upon the intruder’s capacity
to make inferences (but not on the outputs themselves other than via these inferences) as well as upon other motivations
of the intruder. It is assumed that the actions will also be influenced by the agency’s disclosure management approach, for
9
Agency
A
A
AAU
Disclosure
Limitation
-
Disclosure
Utility
Management
Methods
A
A
A
A
U
A
Intruder
A
AU
Inference
-
?
Actions
?
Disclosure
Disclosure
Potential
Harm
Figure 2: Framework for Agency’s Decision Making
example by any penalties for misusing the outputs. The disclosure harm is taken to be purely a function of these actions.
Given the three distinct loss criteria, the general decision problem faced by the agency is a multiple objective one
(Keeney and Raiffa, 1976). In fact, the multiplicity of objectives is even greater since each of the loss criteria may be multidimensional. The measurement of utility will usually require a trade-off between the needs of different users, for example
reducing the geographic detail in an output may reduce the utility of the output far more to a user in local government (in a
small municipality) than to a user in national government. Measures of the potential for predictive disclosure are variablespecific and thus typically multiple. Measuring harm also requires consideration of multiple dimensions, for example the
harm to respondents vs. the harm to the agency.
There is an extensive literature on multiple objective decision making (e.g. Keeney and Raiffa, 1976; French, 1988). A
classical approach would build on an assumption that the agency has preferences between any pair of different consequences
to demonstrate the existence of a real-valued loss function of the different loss criteria and actions (DeGroot, 1970, Ch. 7).
The agency’s optimal decision would then be to choose that action which minimized the expected loss (DeGroot, 1970, Ch.
8). We do not seek to develop the application of such an approach in this paper, however. We note that any such approach
10
would need to address the trade-off between the the three loss criteria (Doyle et al., 2001, p.1). See e.g. Keeney and Raiffa
(1976) for some discussion of trade-offs under uncertainty including, for example, the notion of efficient frontiers. Gomatam
et al. (2005b) provide some discussion of such notions in the context of the two criteria of disclosure risk and utility. Such
ideas could be extended to the trade-off between utility, harm and disclosure potential. Thus, for example, an agency might
seek to maximize utility subject to upper bound constraints on each of disclosure potential and harm. A partial order on
release options might be defined with respect to each of decreasing potential, decreasing harm and increasing utility and an
associated potential-harm-utility frontier defined.
Specifying upper bound constraints for disclosure potential and harm will typically represent a key challenge for an
agency. The separate assessment of these two criteria proposed in this paper should enable an agency to place greater
reliance on methods of statistical science to set an upper bound for disclosure potential. Nevertheless, difficulties in establishing upper bounds for harm may in practice imply that judgements about tolerance levels for the two criteria will still
require joint consideration. Thus, a higher level of disclosure potential might be tolerable if it is judged inconceivable that
any serious harm could result from a proposed release, whereas a lower tolerance level might be specified if this is not the
case. Such judgements about potential harm might be based upon the sensitivity of the data product to be released. Reports
of illegal drug use or detailed information about financial assets are examples of highly sensitive data given by National
Research Council (2005, p.71), where a lower tolerance level for disclosure potential might be judged prudent.
3
3.1
The Intruder as Decision Maker
Decision Theory
We now turn to considering the intruder’s perspective and explore his/her potential role as a decision maker. The steps which
an intruder might take are set out in Figure 3. The first step is to attempt disclosure. This involves the intruder gaining access
to the outputs, which may require overcoming obstacles, such as the completion of a licensing agreement. The intruder may
attempt disclosure in a number of ways. For example, if the output consists of a microdata file then the intruder may attempt
to use record linkage software to link a microdata record to a record on an external database of known individuals. We
may even include the possibility that a user discovers the opportunity for disclosure inadvertently by observing an unusual
feature of the output and hence only becomes interested in disclosing information after gaining access to the output, despite
11
having no mischievous intentions originally. ‘Methods of attack’ are discussed further in Section 4.1.
Attempt at
Disclosure
Inference
-
Actions
Figure 3: Potential Behaviour of an Intruder
The second and third steps in Figure 3 have already been distinguished in the previous section. If the intruder takes
the inference step then he/she is able to compute predictive distributions. If the action step is taken then these distributions
are used in some way. We now focus on these two steps since they have received particular attention in the literature. In
particular, Duncan and Lambert (1986, 1989) introduced the use of decision theory (from the intruder’s perspective) to
represent these two steps. Their framework is now outlined.
Let y denote a value which the intruder wishes to disclose. As discussed in Section 2.2, y is the (target) value of a
variable (which is not publicly known) of a data subject in the case of predictive disclosure or the identity of an element of
the output in the case of identity disclosure. At the inference step, suppose that a standard Bayesian approach is adopted,
whereby the intruder’s prior distribution for y is updated using the output, to obtain a posterior predictive distribution pI (y).
The subscript I is to emphasize the dependence on the intruder and to contrast it to p(y) in Section 2.2. Now consider
the action step and let a denote a consequent action which the intruder might take, for example to claim publicly that the
target value is a specified value or to claim that the identity of an element of the output is a specified person (or other
unit). Suppose that the intruder specifies a loss function LI (y, a), representing his/her loss incurred from action a when the
true value is y. Then, the Bayesian optimal choice of a will be to minimise the expected value of this loss, i.e. minimise
P
y LI (y, a)pI (y) with respect to a. Duncan and Lambert (1989) argue that it is natural for the agency to suppose that the
intruder adopts this optimal strategy since it represents a conservative assumption.
3.2
Relation to Disclosure Potential and Harm
How does this decision theoretic formulation for the intruder relate to disclosure potential and harm as conceived in Section
2?
We contend first that the introduction of the intruder’s loss function and the consideration of expected loss is not relevant
to the consideration of disclosure potential, but only to disclosure harm. The intruder’s loss function refers to actions which,
12
as represented in Figure 2, only affect disclosure harm.
We distinguish the distribution p(y), which may be formulated in a publicly defensible way by the agency, and the
intruder’s distribution pI (y). The latter might be referred to as the risk of perceived disclosure, (c.f. Lambert, 1993),
emphasizing its dependence on the intruder’s perspective. Lambert (1993) states that “the disclosure limitation model of
Duncan and Lambert . . . does not separate true and false disclosures, since what matters is what the intruder believes has
been disclosed” (p.315) and that “disclosure is limited only to the extent that the intruder is discouraged from making
any inferences, correct or incorrect, about a particular target respondent” (p.316). We conclude that the decision theoretic
formulation for the intruder may be relevant to those elements of the agency’s disclosure management approach (or indeed
choice of SDL methods) designed to discourage disclosure attempts and it may be relevant to the agency’s consideration
of disclosure harm, in particular since false disclosures may be harmful, but it is not relevant to the assessment of ‘true’
disclosure potential.
The decision theoretic framework of Duncan and Lambert (1986, 1989) is extended by Dobra et al. (2003). They set out
a very general framework where each of the agency, the intruder and the genuine user face decisions in relation to their own
actions and loss functions. They conceive of the intruder as operating in a similar way to Duncan and Lambert (1986, 1989),
i.e. adopting an optimal strategy based upon expected loss calculations. They then set up a framework in which the agency
takes decisions in the light of the potential intruder actions. They define an agency loss function which “quantifies, from the
agency’s perspective, the harm that the intruder’s action produces to the agency and the data providers” for a given ‘state of
the world’. They then define the ‘disclosure risk’ as the expected value of this agency loss function, where the expectation
is with respect to the agency’s posterior probability distribution about the actions of the intruder and the state of the world.
They assume that the agency knows the intruder’s target, prior and loss function. Their definition refers, however, to what
we have called disclosure harm not disclosure potential. Some further extensions of the decision theoretic framework of
Duncan and Lambert (1986, 1989) are given by Trottini (2001, 2003) and Trottini and Fienberg (2002).
A related application of decision theory to the actions of each of the agency, the intruder and the genuine user is
presented by Keller-McNulty et al. (2005). They view these actions, as in a game theory perspective, as ones where the
intruder is an adversary of the agency and the genuine user. By focussing on actions rather than inference, their approach
may also be viewed as of relevance to disclosure harm but not potential. Their implicit definition of disclosure harm is
the intruder’s expected utility which is analogous (after multiplying by -1) to the expected loss in the Duncan and Lambert
13
(1986) framework, although they adopt a particular approach to defining the loss function in terms of Shannon’s information
entropy. This definition of disclosure harm is analogous to a special case of the definition of Dobra et al. (2003), where
the agency assesses harm purely in terms the intruder’s perspective, with anything that the intruder views as a gain being
viewed by the agency as a loss.
4
Measuring Disclosure Potential in terms of Predictive Distributions
We now return to the agency’s perspective and consider the conceptual basis of disclosure potential. We build on the notion
of a predictive distribution p(y) for a target y introduced in Section 2.2. Recall that this embraces not only the notion of
predictive disclosure but also that of identity disclosure. Although, as discussed in Section 3, we assume that disclosure
potential does not depend on any actions of an intruder which might follow an attempt at disclosure, we still need to consider
the possible dependence of p(y) on the way in which the intruder attempts disclosure and this is discussed in Section 4.1.
The move from the intruder’s to the agency’s perspective also raises the question of how to take account of this change of
perspective and this is discussed in Section 4.2. Some more issues in implementing definitions of disclosure potential are
presented in Section 4.3.
4.1
Dependence on Nature of Attack
The predictive distribution p(y) may depend on the agency’s statistical outputs, denoted O, the auxiliary data which the
intruder has available, denoted D, as well as on the way in which the intruder obtained these data, which we refer to as
the method of attack. Suppose that the agency is able to enumerate possible scenarios k = 1, . . . , K, each of which might
correspond to a different attack method k, and/or a different set of auxiliary data Dk , and/or a different intruder. The
predictive distribution for scenario k is then denoted p(y|O, Dk , attack method k). If the agency could attach a probability
p(Dk , attack method k|O) to each scenario k and if these scenarios were mutually exclusive, the agency could, in principle,
compute an unconditional predictive distribution
p(y|O) =
X
p(y|O, Dk , attack method k)P (Dk , attack method k|O).
k
We argue, however, that it is more natural to define disclosure risk conditional on the scenario, since this seems to
14
correspond better to the kinds of obligations, like that for Title 13 in Section 2.2, which refer to whether disclosure can
be achieved. Suppose, for example, that only one attack method is feasible, that identity disclosure can be achieved with
probability 0.99 if this method is used, but that there is reason to suppose that the probability that this attack method will
be used is 0.001. Then we suggest that this form of output does not meet the confidentiality requirements of Title 13 (see
Section 2.2) since the probability that identity disclosure can be achieved (if an attempt is made) is high, even though the
probability that a disclosure will take place might be judged low.
An advantage of conditioning on the scenario is that it avoids the need to make probability judgments about whether
the scenario would occur. Such judgements are very difficult, given the hypothetical nature of the intruder, and seem rather
different than the kinds of probability judgements required for assessing the potential for inference. The former judgements
thus seem better considered alongside the actions related to disclosure harm. See Marsh et al. (1991) and National Research
Council (2005, p.70) for further discussion of the probability of an attack.
A related advantage of conditioning is that it removes the direct connection between disclosure potential and what was
called disclosure management in Section 2.1, enabling the tasks of the agency to be separated into assessments of:
(i) disclosure potential, under different scenarios, and how this depends on the choice of SDL method;
(ii) attempts at disclosure and disclosure harm, and how these depend upon disclosure management and (indirectly) on
SDL methods;
(iii) utility and its dependence on SDL methods and disclosure management.
Of course, specification of the scenarios of attack in (i) requires some speculation about possible intruder behaviour, but
it suffices to focus such speculation on the identification of potential auxiliary data sources (for which disclosure potential
needs to be assessed).
A consequence of conditioning on the scenario is that there may be a multitude of measures of disclosure potential,
given the large number of possible scenarios. The assessment exercise may thus be viewed as a sensitivity analysis. Frank
(1986, p.22) considers that this represents a “fundamental difficulty” and that the need to specify “predictive distributions
for all conceivable users” could be “intractable”. Some research has been undertaken on possible scenarios (e.g. Duncan et
al., 2011, Sect. 2.1.3) and such research provides a basis for specifying different scenarios in assessing disclosure potential
(e.g. Paass, 1988, Willenborg and De Waal, 2001, Sect. 2.3). In practice, it is common for an agency to specify a set of such
15
scenarios against which it wishes to protect and to update these in the light of new research on external data sources. Such
research is important, given the strong dependence of disclosure potential on the nature of the auxiliary data available.
Some reduction in the task of investigating all possible scenarios may, in principle, be achieved, by restricting attention
to the worst case(s) (e.g. Duncan et al., 2001b). Alternatively, there may be grounds for considering only ‘reasonable’
scenarios and not the most extreme ones. In its guidance on the interpretation of the UK Code of Practice for Official
Statistics, Government Statistical Service (2009) states that “account should be taken of the ‘means likely reasonably to
be used’ to identify an individual. Thus the hypothetical but remote possibility of identification is not something that
automatically makes a statistic disclosive. The design and selection of intruder scenarios should be informed by the means
likely reasonably to be used to identify an individual in the statistic” (p.11). A related example is the use of ‘de facto
anonymisation’ of business microdata in Germany, for which scenarios are excluded from consideration if the intruder’s
“costs of trying to reidentify records in the dataset” are deemed to be “higher than the benefit gained by the disclosed
information” (Brandt et al., 2008).
In our formulation above, we have expressed the predictive distribution, p(y|O, Dk , attack method k), as dependent
not only on the outputs and the data available to the intruder but also on the attack method. It is more conventional to specify
dependence upon only the output O and the data D. See e.g. the definition of identification risk in Reiter (2005, equation
1). Although, this may be reasonable in practice in many situations, Skinner (2007) shows that, in general, there may be an
additional dependence on the attack method, i.e. the attack method may be ‘non-ignorable’ in the sense that:
p(y|O, D1 = D, attack method 1) 6= p(y|O, D2 = D, attack method 2)
for two different attack methods, even though the auxiliary data D observed by the intruder as a result of each method of
attack may be the same. An example presented by Skinner (2007) is a comparison between:
• a directed search, where the intruder selects a known individual from the population and seeks a match in the output,
vs.
• a fishing expedition, where the intruder selects an unusual element of the output and then seeks a match in the
population.
Skinner (2007) suggests that in such cases it may be reasonable to identify and assume a realistic worst case for the attack
16
method, given D.
4.2
Dependence on Intruder Perspective via a Subjective Bayesian Approach
As noted in the previous section, our definition of p(y|O, Dk , attack method k) already requires consideration of the
intruder’s perspective via the auxiliary data Dk , available to the intruder, and possibly via the intruder’s attack method.
There remains the question of whether the agency should specify any further dependence of p(y|O, Dk , attack method k)
on the intruder’s perspective. Many forms of prior information, available to the intruder from their personal knowledge and
experience, may be incorporated into the auxiliary data Dk within our framework. We focus instead in this section on the
possible use of a Bayesian approach to represent the intruder’s pre-existing information or beliefs as a prior distribution
for some parameters in the model upon which p(y|O, Dk , attack method k) is based. Or, more generally, the model
itself may be interpreted in a subjective Bayesian way as reflecting the intruder’s uncertainty about y. See Fienberg et al.
(1997) and Reiter (2005) for illustration in the case of identity disclosure. Such Bayesian approaches may be contrasted
with comparable model-based frequentist approaches, as in Skinner and Shlomo (2008), which also base estimates of
p(y|O, Dk , attack method k) on a model, but do not seek to employ prior distributions to reflect the intruder perspective,
nor do they view the model as representing the intruder’s subjective perspective, but rather as an ‘objective’ model which
may be specified by the agency using data-based techniques of statistical modelling. For example, Skinner and Shlomo
(2008) propose a data-based technique for selecting a log-linear model which ‘optimises’ certain prediction properties of
the model and does not attempt to incorporate prior information.
A basic question for a subjective Bayesian approach is: what criteria should the prior distribution be required to meet
for the resulting predictive distribution of y to reflect an appropriate notion of disclosure potential? If the prior distribution
is allowed to represent any subjective beliefs of an intruder then, as Lambert (1993) discusses, it seems more appropriate
to view the predictive distribution as reflecting perceived risk, which may embrace incorrect disclosure as well as correct
disclosure. As discussed in Section 3.2, such perceived risk may be relevant to assessments of disclosure harm. However,
in our view, the kinds of obligations discussed at the beginning of Section 2.2 require disclosure potential to be defined in
terms of correct disclosure and for the predictive distribution to have an inferential basis which is defensible under public
scrutiny (in the same way that any outputs of official statistics need to be publicly defensible). We do not see that these
requirements can be guaranteed if priors are allowed to be any plausible subjective distribution for any intruder. This could
17
include, for example, the case of an intruder who is over-confident that an observed match is correct on the grounds that a
combination of matching variables is unique in the population, failing to appreciate the potential for non-uniqueness or for
the match to have arisen because of measurement errors or other reasons. It thus seems inappropriate for the definition of
disclosure potential to be dependent upon intruders’ unjustified prior perceptions. To answer the question at the beginning
of this paragraph, we consider that it should be possible to defend any prior distribution used in a Bayesian approach by
justifying how it leads to a predictive distribution which reflects a valid probability of correct disclosure. More specifically,
we consider that the construction of the prior distribution should be defensible from the agency’s perspective (and thus in
an inter-subjective way) on the basis of plausible assumptions about the prior information available to the intruder.
In principle, one could imagine that an agency might itself seek to elicit these priors. In practice, however, the task
of identifying plausible sources of auxiliary information, Dk , is already so challenging that it seems understandable that
agencies might place such elicitation outside the bounds of their standard procedures. This appears to be the usual case in
practice till now. In Reiter (2005), perhaps the most substantial practical application of Bayesian methods to identification
risk assessment to date, intruders’ prior distributions have very little prominence. The main value of Bayesian methodology
in current disclosure risk assessment practice seems more of a technical one, that it provides a clear conceptual way of
integrating out uncertainty about parameters in the predictive distribution p(y|O, Dk , attack method k). Empirical evidence is needed to assess whether there are non-negligible practical differences between the resulting Bayesian measures
and comparable frequentist model-based measures, as developed by Skinner and Shlomo (2008).
4.3
Implementation of Measures of Disclosure Potential
Having addressed conceptual aspects of the predictive distributions in the previous two sections, we now turn to consider
some ways in which an agency may implement measures of disclosure potential based on these distributions. This might be
viewed as the problem of ‘estimating’ these distributions. We focus on the case of identity disclosure, where y is binary, in
the context of the release of microdata. For some discussion of attribute disclosure (where y is the value of a variable for a
target unit) see Duncan and Lambert (1986).
We consider a method of attack where the intruder seeks to link a record in the microdata file to some external data source
of known units using values of some variables, which are included in both the microdata and the external source. These
variables are often called key variables and their values in the external data source define the auxiliary data D (Bethlehem
18
et al., 1990). A basic problem with estimating p(y|O, D, attack method) in this context is that D is hypothetical and thus
unknown. There are two established approaches to specifying D and estimating the predictive distribution:
• an empirical matching experiment - construct a surrogate file D, for which the true correspondence between the
records in D and O is known by the agency, mimic the behavior of the intruder by using a record linkage algorithm
to match D and O and record the proportion of correct matches;
• a modelling approach - make assumptions about the nature of D (and the attack method) within a modelling framework which enables p(y|O, D, attack method) to be specified and then make inference about this distribution, given
the data available to the agency.
We consider these two approaches in turn.
The empirical matching experiment cannot be used to estimate the probability of a correct match for a specific pair of
records since all that is observed at this level is binary, either match or non-match. Hence, such an experiment will not
provide a ‘record-level’ measure of disclosure potential. Instead, the proportion of correct matches across a set of records,
possibly the whole file, provides an estimated probability, which may be treated as a ‘file-level’ or ‘subfile-level’ measure.
For the estimate to be reliable, the number of records in the set will need to be sufficiently large. However, as a result
of ‘smoothing’ across all these records, this approach may fail to identify the most ‘risky’ records. An advantage of the
empirical matching approach is that it can accommodate any matching algorithm which an intruder might use, for example
a deterministic record linkage approach (Spruill, 1982), and any SDL method and, in this sense, can avoid modelling
assumptions. In particular, the approach does not depend upon assumptions about intruder perceptions and Lambert (1993)
thus terms the empirical proportion the risk of true identification. A key practical challenge in an empirical matching
experiment is how to construct a realistic surrogate intruder dataset, which allows for the disclosure protection arising from
sampling and measurement differences between sources and for which there is some overlap of units with the microdata and
the nature of this overlap is known. Sometimes there may be a suitable alternative data source (e.g. Blien et al., 1992) or
a different survey undertaken by the agency, although agencies often control sampling to avoid such overlap. Even if there
is overlap, determining which units are in common may be resource intensive, discouraging routine use of this approach.
In the absence of another dataset, the agency may consider a ’re-identification’ experiment, in which the microdata file is
matched against itself, normally after the application of some SDL method (Paass and Wauschkuhn, 1985, Paass, 1988 and
Winkler, 2004).
19
The modeling approach may be formulated in the same conceptual framework as the empirical matching experiment,
but seeks to obtain expressions for the predictive distributions via theoretical arguments, under assumptions about the nature
of D and the attack method. Measures of disclosure potential associated with these expressions may then be estimated from
the microdata. A practical disadvantage of this approach, compared to the empirical matching approach, is that it may
not be theoretically straightforward to accommodate any specific matching algorithm which an intruder might use. Instead,
approximating assumptions might be made. These may have the benefit of providing simpler or more interpretable measures
of disclosure potential.
An advantage of the modelling approach is that it permits the estimation of record-level measures of identifiability. A
concern with file-level measures is that the principles governing confidentiality protection often seek to avoid the identification of any individual, that is require the probability to be below a threshold for each record, and such aims may not
adequately be addressed by file-level measures. In contrast, record level measures, which take different values for each
microdata record, may help identify those parts of the sample where disclosure potential is high and more protection is
needed and may be aggregated to a file level measure in different ways if desired (Lambert, 1993).
Model-based expressions for predictive distributions have tended to be studied separately for continuous and categorical
key variables. We shall illustrate some points about predictive distributions in the categorical case. For continuous key
variables, where random noise is added to the values of the key variables appearing in O, see e.g. Fuller (1983) who
derives expressions for the relevant predictive distributions and discusses their estimation. Paass and Wauschkuhn (1985)
and Paass (1988) discuss a general approach where identification is viewed as a classification problem and discriminant
analysis techniques are used.
Turning to the case when the key variables are categorical, suppose initially that they are recorded in the same way in D
and O and that no SDL method is applied. In this case, a simple expression for the probability that an observed exact match
between records in the two sources is correct is 1/F , where F is the number of units in the population which share the same
key variable values as the two matching records (Duncan and Lambert, 1989; Skinner and Holmes, 1998). Skinner (2008)
derives this expression under the assumption that the intruder employs a probabilistic record linkage method, of the type
considered by Fellegi and Sunter (1969). The expression 1/F assumes that there is only a unique matching record in the
microdata and that certain exchangeability assumptions hold about the F matching units. More importantly, it assumes that
F is part of D and thus included in the conditioning set in p(y|O, D, attack method). We contend that this conditioning set
20
should consist of the information assumed to be available to the intruder (not the information available to the agency) and,
thus, whether F should be part of the conditioning set depends on whether it is reasonable to suppose that the intruder knows
F . In many realistic settings, it may be assumed that F is unknown to the intruder. In this case, p(y|O, D, attack method)
may be expressed as E(1/F |O, D, attack method), where F is ’integrated out’ using its conditional distribution given what
the intruder observes. Skinner and Holmes (1998) and Elamir and Skinner (2006) provide expressions for this probability
assuming the key variables obey certain log-linear models and discuss how this probability may be estimated given only
the sample microdata. Skinner and Shlomo (2008) discuss the specification of these models. The estimated probabilities
may be viewed as record-level measures of disclosure potential. A simpler approach is obtained by assuming that the
match observed by the intruder is obtained randomly (with equal probability) from all units in the population which match
a microdata record which has a unique combination of key variable values in the sample. The resulting probability can
be expressed as 1/F̄ , where F̄ is the mean value of F across all key combinations of values of key variables which are
unique in the sample. This measure may be estimated simply from sample microdata (Skinner and Elliot, 2002; Skinner
and Carter, 2003) and may be treated as a file level measure. Some related file-level measures, such as the proportion of
‘sample uniques’ that are ‘population unique’, are discussed by Bethlehem et al. (1990) and Fienberg and Makov (1998).
Model-based assessment can become more complex when the output O has been subject to SDL methods. Reiter
(2005) demonstrates how measures of identifiability can be obtained for a variety of SDL methods, including recoding, data
swapping and adding random noise. Shlomo and Skinner (2010) consider the use of misclassification-based SDL methods.
5
Further Consideration of Disclosure Harm and Intruder Behaviour
We have argued for separating the assessment of disclosure potential from the assessment of disclosure harm with a view
to focussing the role of statistical science in the former task. Now, we make some remarks about the latter task. From the
agency’s decision taking perspective, the key question is how to reduce disclosure harm through disclosure management
approaches together with the SDL methods.
We conceive of disclosure harm as the expected value of the loss incurred from the consequences of intruder behaviour.
Assessing disclosure harm thus requires the assessment of three components:
• the nature of potential intruder behaviour and its consequences;
21
• the uncertainty about intruder behaviour and the consequences;
• the loss incurred from these consequences.
Although it is clearly feasible for Bayesian decision theory to have a place in modelling intruder behaviour, as discussed
in Section 3.2, we suggest that the scientific assessment of potential intruder behaviour in the face of alternative disclosure
management (and SDL) approaches, is more a matter for social rather than statistical science. Assessing which kinds of
people might attempt disclosure and the effects of approaches such as user training on the probability of a user attempting
disclosure seem social science questions and may be addressed by empirical experiments. For example, O’Hara et al. (2011)
describe an experiment where intruder behaviour was mimicked by recruiting postgraduate students who, like hackers in the
real world, lacked knowledge of large-scale data handling and the SDL literature but had good computing skills and were
driven by the aim of identifying subjects or disclosing further information about them in an anonymized microdata source.
The experiment provided the agency with a better understanding of the kinds of attacks which intruders might employ and
the kinds of threats arising from such behaviour.
The assessment of uncertainty about potential intruder attacks and behaviour seems a very different task to the use of
statistical inference to assess disclosure potential. Systematic modelling or assessment of such uncertainty seems more
related to social science and risk management.
Valuing the loss which would be incurred from specified consequences of intruder behaviour seems more a matter for
policy judgement. Social science may inform this judgement, for example research into respondents’ perspectives and
concerns about confidentiality. Different respondents and associated stakeholder groups, such as privacy organisations, do
not all share a common perspective and dealing with these differences is a policy issue. Handling uncertainty about potential
harmful outcomes is unlikely to be simply a matter of considering expected outcomes. Most agencies will also wish to take
account of public perceptions of potential harm, in particular since these may adversely affect participation in surveys run
by the agency (Singer, 2004; Couper et al., 2008, 2010). There are difficult challenges too in taking account of potential
changes in public perceptions over time, in particular since the intruder behaviour and its consequences may occur well
after the agency makes its release decision. The potential for sudden changes in public concerns about confidentiality was
illustrated by the intense media coverage of losses of data discs by government in the United Kingdom (Hand, 2008). Issues
in the management of public perceptions of the agency may also arise, but these are not ones of technical statistical science.
22
6
Implications for Agency Practice
The conceptual framework in this paper may be used by an agency to structure its evaluation of SDL and disclosure management options. This framework takes account of the different kinds of expertise which are needed by staff undertaking
the evaluation or are obtained through consultation with individuals and bodies outside the agency. The restriction of a
broader notion of disclosure risk to the narrower one of disclosure potential is designed to enable it to be assessed by
agency statisticians alone with methods of statistical science, by excluding consideration of matters like intruder behaviour,
public perceptions of disclosure risk and false disclosure, which are relevant to the separate criterion of disclosure harm.
The assessment of this latter criterion needs more multidisciplinary input, from social scientists, stakeholders representing
respondents and policy makers (see Section 5). Assessment of utility, our third criterion, requires input from the users of the
outputs. Overall decision making requires further policy input, such as through a microdata review panel, to take account
of trade-offs between the three criteria.
The evaluation of disclosure potential might be divided into three kinds of tasks. First, there is a need for ongoing
assessment and updating of plausible scenarios and associated sources of auxiliary information which intruders might use.
This is necessary background information for any assessment of disclosure potential.
Second, evaluation will be required for routine decisions about release. In this case, it is appealing in practice for rules
to be simple, transparent and objective. Some discussion of two types of approach to assessing disclosure potential was
given in Section 4.3 for the case of identity disclosure in microdata: (i) empirical matching experiments have the advantage
that they can handle a wide range of SDL methods although they may be somewhat elaborate for routine use, (ii) modelling
approaches can provide the rationale for simpler measures, provided the SDL method can be handled tractably. Thresholds
for levels of disclosure potential will need to be established from broader policy considerations. Harm considerations may
lead to different thresholds as well as different disclosure management procedures for different kinds of survey variable.
For example, health variables might be subject to more stringent rules.
Third, detailed assessments may be undertaken as part of occasional in-depth exercises designed to choose between
alternative major types of SDL approach or to validate simple disclosure protection rules. We proposed in Section 4 that
disclosure potential be measured in terms of probabilistic prediction, whether of identity or unknown attributes, given the
observable statistical outputs and hypothetical auxiliary data. Simple measures of disclosure potential might therefore be
validated though simulation studies calibrating the probabilistic measures against their empirical performance.
23
7
Conclusions
Statistical disclosure limitation is well established in statistical science as a body of theory and methods and remains the
subject of active research. Although there is no shortage of SDL methods which have found application by government
statistical agencies, a common scientific methodology for assessing disclosure risk and making decisions based upon these
assessments has found less ready adoption in agency practice. This paper has addressed the foundations of this topic with
the aim of clarifying what statistical science can contribute to such decision making and what it cannot.
We have argued for the assessment of disclosure risk to be separated into the assessments of disclosure potential and
disclosure harm, enabling the use of methods of statistical science to be focussed on the former task. Whilst we have
recognized a role for statistical decision theory, we have sought to remove intruder behaviour from its ambit, viewing that
as of more relevance to what we have called disclosure management. Nevertheless, in our more detailed consideration of
disclosure potential, we have discussed how it may depend on the nature of potential intruder attacks.
Our discussion of how to assess disclosure potential in practice has been limited to a prototypical set-up and there is
certainly much scope for further research, as recommended by National Research Council (2005, p.72), taking account of
the different kinds of SDL methods finding application under evolving modes of access.
Acknowledgements
I am grateful to Natalie Shlomo and two reviewers for comments. Research was supported by the Economic and Social
Research Council.
References
Abowd, J.M., Nissim, K. & Skinner, C. (2009). First issue editorial. J. Privacy Confident., 1, 1-6.
Bethlehem, J.G., Keller, W.J. & Pannekoek, J. (1990). Disclosure control for microdata. J. Amer. Statist. Assoc., 85, 38-45.
Blien, U., Wirth, H. & Müller, M. (1992). Disclosure risk for microdata stemming from official statistics. Statist. Neerland.,
46, 69-82.
24
Brandt, M., Lenz, R. & Rosemann, M. (2008). Anonymisation of panel enterprise microdata - survey of a German project.
In Privacy In Statistical Databases, Lecture Notes In Computer Science 5262, Eds. J. Domingo-Ferrer & Y. Saygin, pp.
139-151. Berlin: Springer.
Couper, M.P., Singer, E., Conrad, F.G. & Groves, R.M. (2008). Risk of disclosure, perceptions of risk and concerns about
privacy and confidentiality as factors in survey participation. J. Official Statist., 24, 255-275.
Couper, M.P., Singer, E., Conrad, F.G. & Groves, R.M. (2008). Experimental studies of disclosure risk, disclosure harm,
topic sensitivity, and survey participation. J. Official Statist., 26, 287-300.
Cox, L.H. (2001). Disclosure risk for tabular economic data. In Confidentiality, Disclosure And Data Access: Theory And
Practical Applications For Statistical Agencies, Eds. P. Doyle, J.I. Lane, J.J.M. Theeuwes & L.V. Zayatz, pp. 167-183.
Amsterdam: Elsevier.
Cox, L.H., Karr, A.F. & Kinney, S.K. (2011). Risk-utility paradigms for statistical disclosure limitation: how to think, but
not how to act (with discussion). Int. Statist. Rev., 79, 160-183.
Dalenius, T. (1977). Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 5, 429-444.
DeGroot, M.H. (1970). Optimal Statistical Decisions. New York: Wiley.
Dobra, A., Fienberg, S.E. & Trottini, M. (2003). Assessing the risk of disclosure of confidential categorical data. In Bayesian
Statistics 7, Proceedings Of The Seventh Valencia International Meeting On Bayesian Statistics, Eds. J.M. Bernardo, M.J.
Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith & M. West , pp. 125-144. Oxford University Press.
Doyle, P., Lane, J.I., Theeuwes, J.J.M. & Zayatz, L.V. (2001). Confidentiality, Disclosure And Data Access: Theory And
Practical Applications For Statistical Agencies. Amsterdam: Elsevier.
Duncan, G.T., Elliot, M. & Salazar-González, J.-J. (2011). Statistical Confidentiality. New York: Springer.
Duncan, G.T., Kelly-McNulty, S.A. & Stokes, S.L. (2001b). Disclosure risk vs. data utility: the R-U confidentiality map.
Technical Report No. 121, National Institute Of Statistical Sciences, North Carolina.
Duncan, G. & Lambert, D. (1986). Disclosure-limited data dissemination (with discussion). J. Amer. Statist. Assoc., 81,
10-28.
25
Duncan, G. & Lambert, D. (1989). The risk of disclosure for microdata. J. Bus. Econ. Statist., 7, 207-217.
Dwork, C., McSherry, F., Nissim, K. & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Proceedings Of The 24th Annual International Cryptography Conference - CRYPTO, 528-544. New York: Springer.
Elamir, E.A.H. & Skinner, C.J. (2006). Record level measures of disclosure risk for survey microdata. J. Official Statist.,
22, 525-539.
Elliot, M.J. & Dale, A. (1999). Scenarios of attach: the data intruder’s perspective on statistical disclosure risk. Netherlands
Official Statist., 14, 6-10.
Fellegi, I.P. (1972). On the question of statistical confidentiality. J. Amer. Statist. Assoc., 67, 7-18.
Fellegi, I.P. & Sunter, A.B. (1969). A theory for record linkage. J. Amer. Statist. Assoc., 64, 1183-1210.
Fienberg, S.E. (2006). Privacy and confidentiality in an e-commerce world: data mining, data warehousing, matching and
disclosure limitation. Statist. Sci., 21, 143-154.
Fienberg, S.E. & Makov, U.E. (1998). Confidentiality, uniqueness and disclosure limitation for categorical data. J. Official
Statist., 14, 385-397.
Fienberg, S.E., Makov, U.E. & Sanil, A.P. (1997). A Bayesian approach to data disclosure: optimal intruder behavior for
continuous data. J. Official Statist., 13, 75-89.
Frank, O. (1986). Comment on “Disclosure-limited data dissemination” by G. Duncan & D. Lambert. J. Amer. Statist.
Assoc., 81, 21-22.
French, S. (1988). Decision Theory: an Introduction to the Mathematics of Rationality. Chichester: Ellis Horwood.
Fuller, W. (1983). Masking procedures for microdata disclosure limitation. J. Official Statist., 9, 383-406.
Gotamam, S., Karr, A.F., Reiter, J.P. & Sanil, A.P. (2005a). Data dissemination and disclosure limitation in a world without
microdata: a risk-utility framework for remote access analysis servers. Statist. Sci., 20, 163-177.
Gotamam, S., Karr, A.F. & Sanil, A.P. (2005b). Data swapping as a decision problem. J. Official Statist., 21, 635-655.
26
Government Statistical Service (2009) National Statistician’s Guidance: Confidentiality Of Official Statistics. Office For
National Statistics, UK.
Hand, D. J. (2008). Privacy, data discs and realistic risk. Significance, 5, 11-14.
Karr, A.F., Kohnen, C.N., Organian, A., Reiter, J.P. & Sanil, A.P. (2006). A framework for evaluating the utility of data
altered to protect confidentiality. Amer. Statistician, 60, 224-232.
Karr, A.F., Lin, X., Sanil, A.P. & Reiter, J.P. (2005). Secure regressions on distributed databases. J. Comput. Graphical
Statist., 14, 263-279.
Keeney, R.L. & Raiffa, J. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: Wiley.
Keller-McNulty, S., Nakhleh, C.W. & Singpurwalla, N.D. (2005). A paradigm for masking (camouflaging) information. Int.
Statist. Rev., 73, 331-349.
Lambert, D. (1993). Measures of disclosure risk and harm. J. Official Statist., 9 313-331.
Marsh, C., Skinner, C., Arber, S., Penhale, B., Openshaw, S., Hobcraft, J., Lievesley, D. & Walford, N. (1993). The case for
samples of anonymized records from the 1991 census. J. Roy. Statist. Soc. Ser. A, 154, 305-340.
National Research Council (2005). Expanding Access To Research Data: Reconciling Risks And Opportunities. Panel On
Data Access For Research Purposes, Committee On National Statistics. Washington DC: The National Academies Press.
National Research Council (2007). Putting People On The Map: Protecting Confidentiality With Linked Social-Spatial
Data. Panel On Confidentiality Issues Arising From The Integration Of Remotely Sensed And Self-Identifying Data,
Eds. M.P. Guttmann & P.C. Stern. Washington DC: The National Academies Press.
O’Hara, K., Whitley, E. & Whittall, P. (2011). Avoiding the jigsaw effect: experiences with Ministry of Justice reoffending
Data. Technical Report , Electronics and Computer Science, University of Southampton.
Paass, G. (1988). Disclosure risk and disclosure avoidance for microdata. J. Bus. Econ. Statist., 6, 487-500.
Paass, G. & Wauschkuhn, U. (1985). Datenzugang, Datenschutz und Anonymisierung-Analysepotential und Identifizierbarkeit von Anonymisierten Individualdaten. Munich: Oldenbourg Verlag.
27
Reiter, J. (2005). Estimating risks of identification disclosure in microdata. J. Amer. Statist. Assoc., 100, 1103-1112.
Reiter, J. (2009). Using multiple imputation to integrate and disseminate confidential microdata. Int. Statist. Rev., 77, 179195.
Shlomo, N. & Skinner, C. (2010). Assessing the protection provided by misclassification-based disclosure limitation methods for survey microdata. Ann. Applied Statist., 4, 1291-1310.
Singer, E. (2004). Principles and practices related to scientific integrity. In R.M. Groves, F.J. Fowler, M.P. Couper, J.M.
Lepkowski and Eleanor Singer. Survey Methodology. New York: Wiley.
Skinner, C.J. (2007). The probability of identification: applying ideas from forensic science to disclosure risk assessment.
J. Roy. Statist. Soc. Ser. A, 170, 195-212.
Skinner, C.J. (2008). Assessing disclosure risk for record linkage. In Privacy In Statistical Databases, Lecture Notes In
Computer Science 5262, Eds. J. Domingo-Ferrer & Y. Saygin, pp. 166-176. Berlin: Springer.
Skinner, C.J. & Carter, R.G. (2003). Estimation of a measure of disclosure risk for survey microdata under unequal probability sampling. Survey Methodology 29, 177-180.
Skinner, C.J. & Elliot, M.J. (2002). A measure of disclosure risk for microdata. J. Roy. Statist. Soc., Ser. B, 64, 855-867.
Skinner, C.J. & Holmes, D.J. (1998). Estimating the re-identification risk per record in microdata. J. Official Statist., 14,
361-372.
Skinner, C.J. & Shlomo, N. (2008). Assessing disclosure risk in survey microdata using log-linear models. J. Amer. Statist.
Assoc., 103, 989-1001.
Spruill, N.L. (1982). Measures of confidentiality. Proc. Surv. Res. Sect. Amer. Statist. Ass., 260-265.
Trottini, M. (2001). A decision-theoretic approach to data disclosure problems. Res. Official Statist., 4, 7-22.
Trottini, M. (2003). Decision Models For Data Disclosure Limitation, PhD Thesis, Department Of Statistics, CarnegieMellon University.
28
Trottini, M. & Fienberg, S.C. (2002). Modelling user uncertainty for disclosure risk and data utility. Intern. J. Uncertainty,
Fuzziness and Knowledge-Based Systems, 10, 511-527.
Vaughan, E.J. (1997). Risk Management. New York: Wiley.
Wasserman, L. & Zhou, S. (2010). A statistical framework for differential privacy. J. Amer. Statist. Assoc., 105, 375-389.
Willenborg, L. & De Waal, T. (2001). Elements Of Statistical Disclosure Control. New York: Springer.
Winkler, W.E. (2004). Masking and re-identification methods for public use microdata: overview and research problems.
In Privacy In Statistical Databases, Lecture Notes In Computer Science 3050, Eds. J. Domingo-Ferrer & V. Torra, pp.
231-246. Berlin: Springer.
Woo, M., Reiter, J.P., Organian, A. & Karr, A.F. (2009). Global measures of data utility for microdata masked for disclosure
limitation. J. Privacy Confident., 1, 111-124.
Zaslavsky, A. M. & Horton, N. J. (1998). Balancing disclosure risk against the loss of nonpublication. J. Official Statist.,
14, 411-419.
29