Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Disclosure Risk: Separating Potential and Harm Chris Skinner Department of Statistics, London School of Economics and Political Science, Houghton Street, London WC2A 2AE, UK. Email: [email protected] Summary Statistical agencies are keen to devise ways to provide research access to data while protecting confidentiality. Although methods of statistical disclosure risk assessment are now well established in the statistical science literature, the integration of these methods by agencies into a general scientific basis for their practice still proves difficult. This paper seeks to review and clarify the role of statistical science in the conceptual foundations of disclosure risk assessment in an agency’s decision making. Disclosure risk is broken down into disclosure potential, a measure of the ability to achieve true disclosure, and disclosure harm. It is argued that statistical science is most suited to assessing the former. A framework for this assessment is presented. The paper argues that the intruder’s decision making and behaviour may be separated from this framework, provided appropriate account is taken of the nature of potential intruder attacks in the definition of disclosure potential. Key words: Confidentiality; disclosure limitation; identification; intruder. 1 1 Introduction The challenge of devising ways to provide researchers with access to microdata and other statistical outputs while protecting confidentiality continues to be the subject of intense interest and development around the world. Given the wide variety of contexts in which this challenge arises, it seems unlikely that these developments will converge on any single solution. National Research Council (2005, p.68) recommends that “data produced or funded by government agencies should continue to be made available for research through a variety of modes, including various modes of restricted access to confidential data and unrestricted access to public-use data altered in a variety of ways to maintain confidentiality”. The prototypical set-up is depicted in Figure 1. An agency undertakes a survey (or similar exercise) in which data are collected from data subjects. The agency has a remit to use these data to produce outputs which will serve the statistical needs of the agency’s users. The agency faces the potential problem of statistical disclosure, however, that one of the users may be able to use the outputs to disclose information about one of the data subjects. A (hypothetical) user who ‘misuses’ the outputs in this way is called an intruder. To address this problem, the agency needs to decide what mode of access to employ and, given this mode of access, what statistical disclosure limitation (SDL) methods to employ in the specification of the statistical outputs. We include in the scope of SDL not only methods which modify data by restriction of detail, perturbation (Willenborg and De Waal, 2001, Ch. 4, 5) or conversion into synthetic data (Reiter, 2009), but also methods employed to restrict the outputs which a researcher can obtain from a remote access analysis server (Gomatam et al., 2005a) or can take out from a research data centre (National Research Council, 2005, pp. 29-31). Experience with different modes of access can lead to consensus on ’best practice’ for some features of these approaches. Nevertheless, it is widely recognized that questions about when confidentiality is adequately protected by an SDL method and how to come to that judgement remain hard ones to answer across most modes of access. The decision problem is often posed in a ‘risk-utility’ framework (Duncan et al., 2011, Ch. 6). The extent to which confidentiality is protected is assessed by measure(s) of what is called statistical disclosure risk. The extent to which the outputs satisfy the statistical purposes for which they are produced is assessed by measure(s) of utility. An agency has to trade off these two criteria. The fact that attempts to reduce disclosure risk often also lead to a decrease in utility is described by Doyle et al. (2001, p.1) as “a fundamental tension at the heart of every statistical agency’s mission”. Research into the assessment of disclosure risk has a long history in statistical science, going back at least to the 1970s, e.g. Fellegi (1972) and Dalenius (1977). It also remains an active current area of research, e.g. at the interface with computer 2 Data Subjects A A A Data A AU ? Agency A A A Outputs ' ? Users & A AU Intruder $ % Figure 1: The Different Parties in the Prototypical Set-up where Statistical Disclosure may arise science (Dwork et al., 2006; Abowd et al., 2009; Wasserman and Zhou, 2010). Why then does it remain difficult to use this body of work to support decisions about statistical disclosure in practice? Cox et al. (2011) consider that the risk-utility framework does provide a common language for thinking about confidentiality but they argue, through examples, that it is much less useful in practice and conclude that “today there is not a science of data confidentiality”. In this paper it will be argued that there is a place for statistical science in disclosure risk assessment, but that it is important to separate out what can be achieved by statistical science and what aspects of decision making require other inputs, such as policy judgements. To help obtain this separation, we propose to break down the notion of disclosure risk. The term ‘risk’ has many meanings across scientific disciplines and we take as our starting point its use in risk management (Vaughan, 1997) as an uncertain event with potentially adverse consequences. The risk of statistical disclosure then has two dimensions: the probability of statistical disclosure and a measure of the harmful consequences which may ensue if statistical disclosure occurs. Consideration of both dimensions may be found in the statistical literature. Much attention is devoted to the probability of disclosure (e.g. Reiter, 2005). A number of papers (see end of Section 2.2) suggest, however, that disclosure risk should also embrace the notion of harm. A difficulty in formulating a definition for this two-dimensional representation of disclosure risk, however, is to decide whether statistical disclosure should refer to true disclosure or, more 3 broadly, to any kind of attack and claimed disclosure, whether true or not. We argue in Section 2.2 that, in the context of many official confidentiality statements, it is appropriate that the probability of disclosure refers to true disclosure. On the other hand, we consider that when an agency wishes to assess and manage the potential harm from threats to confidentiality it is appropriate to adopt the broader definition. In summary, we propose to break down the notion of disclosure risk into two separate notions: a measure summarising the uncertainty about whether true disclosure occurs, which we shall refer to as disclosure potential (or identifiability when disclosure is defined as identification) and a measure of the adverse consequences of potential attempts at disclosure (whether successful or not), which we shall refer to as disclosure harm. We shall argue that statistical science is primarily suited to assessing disclosure potential but not disclosure harm. A further source of complexity in developing methodology for decision making about disclosure is the existence of the multiple parties in Figure 1, each of which may make decisions. We shall take the agency’s decisions as the central ones which statistical science is needed to support, but we shall seek to clarify the link between these decisions and the intruder’s perspective. Fienberg et al. (1997, p.75) argue that “a data collection agency must consider disclosure from the perspective of an intruder in order to efficiently evaluate data disclosure limitation procedures”. We shall review this idea. The aim of the paper is then to review, explore and clarify the conceptual basis and logic of decisions by a statistical agency relating to disclosure risk. In particular, we seek to clarify the role of statistical science in the agency’s decision making and, by separation of tasks, to limit the amount of non-statistical assumptions (about matters such as intruder judgements and behaviour) required for disclosure risk assessment. The paper is intended to support the work of statistical agencies in two ways: first, by offering alternative ways of providing transparent and defensible bases of SDL decisions and second, by facilitating the division of labour between statistical methodologists and others, such as policy analysts, within a statistical agency regarding the preparation of evidence to support SDL decisions. Given the breadth of the field, we first set some limits to the scope of the paper: • we recognize, as discussed by National Research Council (2005, pp. 55-59), that confidentiality breaches may occur for reasons of carelessness, illegal intrusions, law enforcement and national security as well as statistical disclosure, but we restrict attention to the latter; • we focus on broader conceptual issues and not the detail of specific SDL techniques; • we shall not discuss utility in any detail; 4 • while decision theoretic ideas underly parts of the development, we shall not attempt any detailed formalization; • we do not seek to provide a comprehensive review of the literature; rather we aim to identify and discuss what are judged to be the key ideas; • we refer to an agency, with government statistical agencies in mind, since these have been the main drivers of developments in SDL methodology, but our discussion should be relevant to other kinds of organisation with responsibilities both to disseminate statistical data products to multiple users and to protect the confidentiality of the sources of the data; • we restrict attention to the prototypical set-up for a single agency, outlined in this section, which provides a basis for discussing key conceptual issues in SDL methodology, and do not consider e.g. secure multi-party computation (Karr et al., 2005; Fienberg, 2006) for distributed databases across several agencies; • we use the term survey to denote the source of the data, but most of what we say will also be relevant to other sources such as a census or an administrative source. The paper is organised as follows. We contrast the agency’s and the intruder’s perspectives in Sections 2 and 3, starting by considering the agency as a decision maker, which is our principal interest. We then proceed to consider the notion of disclosure risk in more detail in Section 4. We reflect further on the role of harm in Section 5 and provide some conclusions in Sections 6 and 7. 2 The Agency as Decision Maker In deciding what actions to take to protect against statistical disclosure, the agency needs to consider: • the nature of the alternative actions; • the loss to the agency arising from the consequences of its actions. Zaslavsky and Horton (1998) provide an example, where the agency needs to decide between two actions, whether to publish an output or not, and the loss to the agency as a result of these actions is defined in terms of the threat of disclosure, if the agency decides to publish, or the loss of information, if the agency takes the alternative action. 5 We discuss the nature of the actions and the loss in Sections 2.1 and 2.2, respectively. We shall return to outlining the decision framework in general terms in Section 2.3. 2.1 The Agency’s Actions We distinguish two kinds of actions which an agency may take when confronted by potential threats to confidentiality: • the use of SDL methods as part of the process of determining the statistical outputs: examples include techniques which transform or perturb outputs, such as recoding, sampling or creation of missing values, adding noise, swapping values, replacing microdata by synthetic data; • additional disclosure management actions to protect against and discourage disclosure attempts and the misuse of outputs: examples include access controls, where users must sign up to certain conditions in a licence agreement, which may include penalties for misuse; or training of users in why confidentiality needs to be protected. See e.g. Willenborg and De Waal (2001) and Doyle et al. (2001) for references to the statistical literature on SDL methods and National Research Council (2005, Ch. 2, 5) for broader issues of disclosure management. 2.2 The Loss to the Agency In order to decide which action to take in the light of possible threats, the agency needs to be able to evaluate the potential consequences of its decisions. In this section, we discuss the nature of the principal criteria relevant to this evaluation and refer to these as loss criteria. The need for multiple criteria makes the decision a complex value problem in the terminology of Keeney and Raiffa (1976, Sect. 1.4.1). We address the question of how to accommodate these multiple criteria when making decisions in Section 2.3. It is natural for the agency’s obligations to protect confidentiality to provide the basis of key loss criteria. These obligations might be legal, professional or ethical. For example, it is stated in the United States Code Title 13 that the Census Bureau is prohibited from producing outputs “whereby the data furnished by any particular establishment or individual under this title can be identified”. Such obligations are frequently expressed in terms of the ability of anyone with access to the outputs to use them to achieve disclosure or breach confidentiality. We conceive of this ability in terms of the potential to infer the value of a target y, embracing the possibilities of: 6 • predictive disclosure: where y is the value of a survey variable for a particular known individual or other unit; • identity disclosure (also called identification): where y is the value of a binary variable indicating whether a particular element of the output (e.g. a record in a microdata file) belongs to a particular known individual or other unit. The Title 13 example above refers to identity disclosure, and the ability to infer this may be called identifiability. This term is used by Government Statistical Service (2009, p.12) to explain the requirements for confidentiality protection in the UK Code of Practice for Official Statistics: an output “will not usually directly reveal the identity of an individual, but more usually the risk is that the statistic may make an individual identifiable - the individual has not yet been identified, but it may be possible to do so”. We shall represent inference about y by a predictive probability distribution p(y|O, D, attack method), which depends on O, the statistical output, and D the auxiliary data available to an intruder who may seek to learn about y using a particular attack method. There will generally be a set of such distributions for different targets y (corresponding to different data subjects and variables) and different possible sources of auxiliary data, Dk , and attack methods k (k = 1, . . . , K). We take the agency’s first loss criterion to be a summary of these predictive distributions in what we call the disclosure potential, (or identifiability in the case of identity disclosure) expressed as the function: H(O; Dk , attack method k; k = 1, . . . , K). The agency is assumed here to have integrated out the uncertainty in the predictive distributions for the various possible targets y. The form of H will depend on what the agency judges to be its obligations. If, for example, these relate to identification and the agency focuses on the worst case of identification, H could be the maximum value of p(y|O, D, attack method) across all identity indicators y for different output elements and data subjects. Alternatively, if the agency focuses on how often this probability exceeds 0.5 or some other threshold, H might count the number of data subjects for which this is the case for some element of the output. We shall discuss the notion of predictive probability distribution further in Section 4. We draw on the use of this distribution by Duncan and Lambert (1986) as a unifying representation of disclosure risk, but do not follow their interpretation of this distribution in terms of the intruder’s perspective, as will be discussed in Section 3. Note also that there is a significant part of the disclosure limitation literature where disclosure relates to deductive (mathematical) inference, e.g. the notions implicit in the p% rule (Cox, 2001) and not probabilistic inference as supposed here. In principle, it would seem that such measures could also be combined into a disclosure potential function, since this is not required to take a probabilistic form, but we shall not explore this possibility further here. 7 Although agencies’ legal and professional obligations to protect confidentiality appear most often to be expressed in terms of inferential criteria with disclosure viewed as the potential outcome of concern, there are also ethical and other reasons why agencies will wish to take account of the possible undesirable consequences of possible disclosure or, more generally, of any actions by intruders designed to breach confidentiality. We refer to such consequences as disclosure harm (c.f. Lambert, 1993). Such consequences may refer to: • respondents: for example the impact of whatever actions an intruder might take following an attempt at disclosure, whether the resulting disclosure is correct or false; such harm might be ’legal, reputational, financial or psychological’ (National Research Council, 2007, p.14) and • the agency: for example, the impact of publicity about reported disclosures on the agency’s reputation and on response rates to surveys run by the agency (Couper et al., 2010). Disclosure harm is our second loss criterion. We discuss its definition further in Section 5. We note that the assessment of harm may need to take account of considerable uncertainty about potential intruder actions, but we treat this uncertainty as quite separate from the inferential uncertainty arising in the assessment of disclosure potential. Both disclosure potential and disclosure harm are criteria which capture the possible consequences of the agency’s actions for respondents and for the agency. It is also, of course, essential to consider the consequences for the genuine users of the outputs. This impact is represented by the utility of the output, which refers to how far it meets the analytical needs of users. Many SDL methods can impact negatively on these needs either through loss of information, e.g. loss of geographical detail, or though reductions of data quality, e.g. by affecting the biases and variances of users’ estimators. These impacts are of crucial importance when evaluating SDL methods but, as noted earlier, are not the main focus of this paper. For further discussion see Karr et al. (2006) and Woo et al. (2009). In summary, we have argued that the loss to the agency be based upon three principal criteria: • disclosure potential; • disclosure harm; • utility of output to users. The distinction between disclosure potential and harm is fundamental in this paper, in contrast to some of the literature where the notion of harm is subsumed within that of risk. For example, Duncan et al. (2001b) view disclosure risk, in its 8 most general form, as including the consequences of intrusion both for the agency and for the intruder, and this enables risk-utility analyses to have more general applicability. Dobra et al. (2003) also subsume harm within risk by referring to disclosure risk as “the extent to which [the output’s] release can harm the agency or the data providers”. To reiterate, our distinction between disclosure potential and harm is that the former refers to the ability of the intruder to disclose (i.e. infer) information, whereas the latter refers to consequences of intruder actions (following a deliberate or inadvertent attempt at disclosure). A related distinction between disclosure risk and harm is presented by Lambert (1993). She restricts disclosure risk to considerations of identity disclosure, whereas it is assumed here that disclosure potential may also refer to predictive disclosure. She states that she considers the latter form of disclosure only to the extent that it may lead to disclosure harm. In this paper, we suppose that identity disclosure could also lead to harm. Our distinction is similar to that in National Research Council (2007, pp. 13-15) and to that between disclosure harm and ’discredit harm’ in Trottini (2001). 2.3 The General Decision Problem The elements of the decision framework outlined in the two previous sections are brought together in Figure 2. The two components of the agency’s actions (SDL methods and disclosure management) each impact directly on what analyses the users can conduct and on the environment in which these analyses are undertaken and hence each component has direct consequences for utility. Our primary focus here, however, is on the other loss criteria. Disclosure potential is defined in terms of the capacity of an intruder to make inferences. The nature of these inferences depend upon the nature of the statistical outputs (and hence the SDL method) and the information additionally available to the intruder. The extent to which disclosure potential may depend additionally upon the way in which an intruder might attempt disclosure and, indirectly, upon the agency’s management approach to discourage such an attempt is discussed in Section 4. The actions of the intruder, as represented in Figure 2, are taken to refer to any actions resulting from an attempt at disclosure. Examples include a journalist publishing a claim that they have identified a known individual in an output and possibly disclosing information about that person or a commercial intruder linking information from the output into a credit referencing database (Paass, 1988). It is supposed that the nature of the actions will depend upon the intruder’s capacity to make inferences (but not on the outputs themselves other than via these inferences) as well as upon other motivations of the intruder. It is assumed that the actions will also be influenced by the agency’s disclosure management approach, for 9 Agency A A AAU Disclosure Limitation - Disclosure Utility Management Methods A A A A U A Intruder A AU Inference - ? Actions ? Disclosure Disclosure Potential Harm Figure 2: Framework for Agency’s Decision Making example by any penalties for misusing the outputs. The disclosure harm is taken to be purely a function of these actions. Given the three distinct loss criteria, the general decision problem faced by the agency is a multiple objective one (Keeney and Raiffa, 1976). In fact, the multiplicity of objectives is even greater since each of the loss criteria may be multidimensional. The measurement of utility will usually require a trade-off between the needs of different users, for example reducing the geographic detail in an output may reduce the utility of the output far more to a user in local government (in a small municipality) than to a user in national government. Measures of the potential for predictive disclosure are variablespecific and thus typically multiple. Measuring harm also requires consideration of multiple dimensions, for example the harm to respondents vs. the harm to the agency. There is an extensive literature on multiple objective decision making (e.g. Keeney and Raiffa, 1976; French, 1988). A classical approach would build on an assumption that the agency has preferences between any pair of different consequences to demonstrate the existence of a real-valued loss function of the different loss criteria and actions (DeGroot, 1970, Ch. 7). The agency’s optimal decision would then be to choose that action which minimized the expected loss (DeGroot, 1970, Ch. 8). We do not seek to develop the application of such an approach in this paper, however. We note that any such approach 10 would need to address the trade-off between the the three loss criteria (Doyle et al., 2001, p.1). See e.g. Keeney and Raiffa (1976) for some discussion of trade-offs under uncertainty including, for example, the notion of efficient frontiers. Gomatam et al. (2005b) provide some discussion of such notions in the context of the two criteria of disclosure risk and utility. Such ideas could be extended to the trade-off between utility, harm and disclosure potential. Thus, for example, an agency might seek to maximize utility subject to upper bound constraints on each of disclosure potential and harm. A partial order on release options might be defined with respect to each of decreasing potential, decreasing harm and increasing utility and an associated potential-harm-utility frontier defined. Specifying upper bound constraints for disclosure potential and harm will typically represent a key challenge for an agency. The separate assessment of these two criteria proposed in this paper should enable an agency to place greater reliance on methods of statistical science to set an upper bound for disclosure potential. Nevertheless, difficulties in establishing upper bounds for harm may in practice imply that judgements about tolerance levels for the two criteria will still require joint consideration. Thus, a higher level of disclosure potential might be tolerable if it is judged inconceivable that any serious harm could result from a proposed release, whereas a lower tolerance level might be specified if this is not the case. Such judgements about potential harm might be based upon the sensitivity of the data product to be released. Reports of illegal drug use or detailed information about financial assets are examples of highly sensitive data given by National Research Council (2005, p.71), where a lower tolerance level for disclosure potential might be judged prudent. 3 3.1 The Intruder as Decision Maker Decision Theory We now turn to considering the intruder’s perspective and explore his/her potential role as a decision maker. The steps which an intruder might take are set out in Figure 3. The first step is to attempt disclosure. This involves the intruder gaining access to the outputs, which may require overcoming obstacles, such as the completion of a licensing agreement. The intruder may attempt disclosure in a number of ways. For example, if the output consists of a microdata file then the intruder may attempt to use record linkage software to link a microdata record to a record on an external database of known individuals. We may even include the possibility that a user discovers the opportunity for disclosure inadvertently by observing an unusual feature of the output and hence only becomes interested in disclosing information after gaining access to the output, despite 11 having no mischievous intentions originally. ‘Methods of attack’ are discussed further in Section 4.1. Attempt at Disclosure Inference - Actions Figure 3: Potential Behaviour of an Intruder The second and third steps in Figure 3 have already been distinguished in the previous section. If the intruder takes the inference step then he/she is able to compute predictive distributions. If the action step is taken then these distributions are used in some way. We now focus on these two steps since they have received particular attention in the literature. In particular, Duncan and Lambert (1986, 1989) introduced the use of decision theory (from the intruder’s perspective) to represent these two steps. Their framework is now outlined. Let y denote a value which the intruder wishes to disclose. As discussed in Section 2.2, y is the (target) value of a variable (which is not publicly known) of a data subject in the case of predictive disclosure or the identity of an element of the output in the case of identity disclosure. At the inference step, suppose that a standard Bayesian approach is adopted, whereby the intruder’s prior distribution for y is updated using the output, to obtain a posterior predictive distribution pI (y). The subscript I is to emphasize the dependence on the intruder and to contrast it to p(y) in Section 2.2. Now consider the action step and let a denote a consequent action which the intruder might take, for example to claim publicly that the target value is a specified value or to claim that the identity of an element of the output is a specified person (or other unit). Suppose that the intruder specifies a loss function LI (y, a), representing his/her loss incurred from action a when the true value is y. Then, the Bayesian optimal choice of a will be to minimise the expected value of this loss, i.e. minimise P y LI (y, a)pI (y) with respect to a. Duncan and Lambert (1989) argue that it is natural for the agency to suppose that the intruder adopts this optimal strategy since it represents a conservative assumption. 3.2 Relation to Disclosure Potential and Harm How does this decision theoretic formulation for the intruder relate to disclosure potential and harm as conceived in Section 2? We contend first that the introduction of the intruder’s loss function and the consideration of expected loss is not relevant to the consideration of disclosure potential, but only to disclosure harm. The intruder’s loss function refers to actions which, 12 as represented in Figure 2, only affect disclosure harm. We distinguish the distribution p(y), which may be formulated in a publicly defensible way by the agency, and the intruder’s distribution pI (y). The latter might be referred to as the risk of perceived disclosure, (c.f. Lambert, 1993), emphasizing its dependence on the intruder’s perspective. Lambert (1993) states that “the disclosure limitation model of Duncan and Lambert . . . does not separate true and false disclosures, since what matters is what the intruder believes has been disclosed” (p.315) and that “disclosure is limited only to the extent that the intruder is discouraged from making any inferences, correct or incorrect, about a particular target respondent” (p.316). We conclude that the decision theoretic formulation for the intruder may be relevant to those elements of the agency’s disclosure management approach (or indeed choice of SDL methods) designed to discourage disclosure attempts and it may be relevant to the agency’s consideration of disclosure harm, in particular since false disclosures may be harmful, but it is not relevant to the assessment of ‘true’ disclosure potential. The decision theoretic framework of Duncan and Lambert (1986, 1989) is extended by Dobra et al. (2003). They set out a very general framework where each of the agency, the intruder and the genuine user face decisions in relation to their own actions and loss functions. They conceive of the intruder as operating in a similar way to Duncan and Lambert (1986, 1989), i.e. adopting an optimal strategy based upon expected loss calculations. They then set up a framework in which the agency takes decisions in the light of the potential intruder actions. They define an agency loss function which “quantifies, from the agency’s perspective, the harm that the intruder’s action produces to the agency and the data providers” for a given ‘state of the world’. They then define the ‘disclosure risk’ as the expected value of this agency loss function, where the expectation is with respect to the agency’s posterior probability distribution about the actions of the intruder and the state of the world. They assume that the agency knows the intruder’s target, prior and loss function. Their definition refers, however, to what we have called disclosure harm not disclosure potential. Some further extensions of the decision theoretic framework of Duncan and Lambert (1986, 1989) are given by Trottini (2001, 2003) and Trottini and Fienberg (2002). A related application of decision theory to the actions of each of the agency, the intruder and the genuine user is presented by Keller-McNulty et al. (2005). They view these actions, as in a game theory perspective, as ones where the intruder is an adversary of the agency and the genuine user. By focussing on actions rather than inference, their approach may also be viewed as of relevance to disclosure harm but not potential. Their implicit definition of disclosure harm is the intruder’s expected utility which is analogous (after multiplying by -1) to the expected loss in the Duncan and Lambert 13 (1986) framework, although they adopt a particular approach to defining the loss function in terms of Shannon’s information entropy. This definition of disclosure harm is analogous to a special case of the definition of Dobra et al. (2003), where the agency assesses harm purely in terms the intruder’s perspective, with anything that the intruder views as a gain being viewed by the agency as a loss. 4 Measuring Disclosure Potential in terms of Predictive Distributions We now return to the agency’s perspective and consider the conceptual basis of disclosure potential. We build on the notion of a predictive distribution p(y) for a target y introduced in Section 2.2. Recall that this embraces not only the notion of predictive disclosure but also that of identity disclosure. Although, as discussed in Section 3, we assume that disclosure potential does not depend on any actions of an intruder which might follow an attempt at disclosure, we still need to consider the possible dependence of p(y) on the way in which the intruder attempts disclosure and this is discussed in Section 4.1. The move from the intruder’s to the agency’s perspective also raises the question of how to take account of this change of perspective and this is discussed in Section 4.2. Some more issues in implementing definitions of disclosure potential are presented in Section 4.3. 4.1 Dependence on Nature of Attack The predictive distribution p(y) may depend on the agency’s statistical outputs, denoted O, the auxiliary data which the intruder has available, denoted D, as well as on the way in which the intruder obtained these data, which we refer to as the method of attack. Suppose that the agency is able to enumerate possible scenarios k = 1, . . . , K, each of which might correspond to a different attack method k, and/or a different set of auxiliary data Dk , and/or a different intruder. The predictive distribution for scenario k is then denoted p(y|O, Dk , attack method k). If the agency could attach a probability p(Dk , attack method k|O) to each scenario k and if these scenarios were mutually exclusive, the agency could, in principle, compute an unconditional predictive distribution p(y|O) = X p(y|O, Dk , attack method k)P (Dk , attack method k|O). k We argue, however, that it is more natural to define disclosure risk conditional on the scenario, since this seems to 14 correspond better to the kinds of obligations, like that for Title 13 in Section 2.2, which refer to whether disclosure can be achieved. Suppose, for example, that only one attack method is feasible, that identity disclosure can be achieved with probability 0.99 if this method is used, but that there is reason to suppose that the probability that this attack method will be used is 0.001. Then we suggest that this form of output does not meet the confidentiality requirements of Title 13 (see Section 2.2) since the probability that identity disclosure can be achieved (if an attempt is made) is high, even though the probability that a disclosure will take place might be judged low. An advantage of conditioning on the scenario is that it avoids the need to make probability judgments about whether the scenario would occur. Such judgements are very difficult, given the hypothetical nature of the intruder, and seem rather different than the kinds of probability judgements required for assessing the potential for inference. The former judgements thus seem better considered alongside the actions related to disclosure harm. See Marsh et al. (1991) and National Research Council (2005, p.70) for further discussion of the probability of an attack. A related advantage of conditioning is that it removes the direct connection between disclosure potential and what was called disclosure management in Section 2.1, enabling the tasks of the agency to be separated into assessments of: (i) disclosure potential, under different scenarios, and how this depends on the choice of SDL method; (ii) attempts at disclosure and disclosure harm, and how these depend upon disclosure management and (indirectly) on SDL methods; (iii) utility and its dependence on SDL methods and disclosure management. Of course, specification of the scenarios of attack in (i) requires some speculation about possible intruder behaviour, but it suffices to focus such speculation on the identification of potential auxiliary data sources (for which disclosure potential needs to be assessed). A consequence of conditioning on the scenario is that there may be a multitude of measures of disclosure potential, given the large number of possible scenarios. The assessment exercise may thus be viewed as a sensitivity analysis. Frank (1986, p.22) considers that this represents a “fundamental difficulty” and that the need to specify “predictive distributions for all conceivable users” could be “intractable”. Some research has been undertaken on possible scenarios (e.g. Duncan et al., 2011, Sect. 2.1.3) and such research provides a basis for specifying different scenarios in assessing disclosure potential (e.g. Paass, 1988, Willenborg and De Waal, 2001, Sect. 2.3). In practice, it is common for an agency to specify a set of such 15 scenarios against which it wishes to protect and to update these in the light of new research on external data sources. Such research is important, given the strong dependence of disclosure potential on the nature of the auxiliary data available. Some reduction in the task of investigating all possible scenarios may, in principle, be achieved, by restricting attention to the worst case(s) (e.g. Duncan et al., 2001b). Alternatively, there may be grounds for considering only ‘reasonable’ scenarios and not the most extreme ones. In its guidance on the interpretation of the UK Code of Practice for Official Statistics, Government Statistical Service (2009) states that “account should be taken of the ‘means likely reasonably to be used’ to identify an individual. Thus the hypothetical but remote possibility of identification is not something that automatically makes a statistic disclosive. The design and selection of intruder scenarios should be informed by the means likely reasonably to be used to identify an individual in the statistic” (p.11). A related example is the use of ‘de facto anonymisation’ of business microdata in Germany, for which scenarios are excluded from consideration if the intruder’s “costs of trying to reidentify records in the dataset” are deemed to be “higher than the benefit gained by the disclosed information” (Brandt et al., 2008). In our formulation above, we have expressed the predictive distribution, p(y|O, Dk , attack method k), as dependent not only on the outputs and the data available to the intruder but also on the attack method. It is more conventional to specify dependence upon only the output O and the data D. See e.g. the definition of identification risk in Reiter (2005, equation 1). Although, this may be reasonable in practice in many situations, Skinner (2007) shows that, in general, there may be an additional dependence on the attack method, i.e. the attack method may be ‘non-ignorable’ in the sense that: p(y|O, D1 = D, attack method 1) 6= p(y|O, D2 = D, attack method 2) for two different attack methods, even though the auxiliary data D observed by the intruder as a result of each method of attack may be the same. An example presented by Skinner (2007) is a comparison between: • a directed search, where the intruder selects a known individual from the population and seeks a match in the output, vs. • a fishing expedition, where the intruder selects an unusual element of the output and then seeks a match in the population. Skinner (2007) suggests that in such cases it may be reasonable to identify and assume a realistic worst case for the attack 16 method, given D. 4.2 Dependence on Intruder Perspective via a Subjective Bayesian Approach As noted in the previous section, our definition of p(y|O, Dk , attack method k) already requires consideration of the intruder’s perspective via the auxiliary data Dk , available to the intruder, and possibly via the intruder’s attack method. There remains the question of whether the agency should specify any further dependence of p(y|O, Dk , attack method k) on the intruder’s perspective. Many forms of prior information, available to the intruder from their personal knowledge and experience, may be incorporated into the auxiliary data Dk within our framework. We focus instead in this section on the possible use of a Bayesian approach to represent the intruder’s pre-existing information or beliefs as a prior distribution for some parameters in the model upon which p(y|O, Dk , attack method k) is based. Or, more generally, the model itself may be interpreted in a subjective Bayesian way as reflecting the intruder’s uncertainty about y. See Fienberg et al. (1997) and Reiter (2005) for illustration in the case of identity disclosure. Such Bayesian approaches may be contrasted with comparable model-based frequentist approaches, as in Skinner and Shlomo (2008), which also base estimates of p(y|O, Dk , attack method k) on a model, but do not seek to employ prior distributions to reflect the intruder perspective, nor do they view the model as representing the intruder’s subjective perspective, but rather as an ‘objective’ model which may be specified by the agency using data-based techniques of statistical modelling. For example, Skinner and Shlomo (2008) propose a data-based technique for selecting a log-linear model which ‘optimises’ certain prediction properties of the model and does not attempt to incorporate prior information. A basic question for a subjective Bayesian approach is: what criteria should the prior distribution be required to meet for the resulting predictive distribution of y to reflect an appropriate notion of disclosure potential? If the prior distribution is allowed to represent any subjective beliefs of an intruder then, as Lambert (1993) discusses, it seems more appropriate to view the predictive distribution as reflecting perceived risk, which may embrace incorrect disclosure as well as correct disclosure. As discussed in Section 3.2, such perceived risk may be relevant to assessments of disclosure harm. However, in our view, the kinds of obligations discussed at the beginning of Section 2.2 require disclosure potential to be defined in terms of correct disclosure and for the predictive distribution to have an inferential basis which is defensible under public scrutiny (in the same way that any outputs of official statistics need to be publicly defensible). We do not see that these requirements can be guaranteed if priors are allowed to be any plausible subjective distribution for any intruder. This could 17 include, for example, the case of an intruder who is over-confident that an observed match is correct on the grounds that a combination of matching variables is unique in the population, failing to appreciate the potential for non-uniqueness or for the match to have arisen because of measurement errors or other reasons. It thus seems inappropriate for the definition of disclosure potential to be dependent upon intruders’ unjustified prior perceptions. To answer the question at the beginning of this paragraph, we consider that it should be possible to defend any prior distribution used in a Bayesian approach by justifying how it leads to a predictive distribution which reflects a valid probability of correct disclosure. More specifically, we consider that the construction of the prior distribution should be defensible from the agency’s perspective (and thus in an inter-subjective way) on the basis of plausible assumptions about the prior information available to the intruder. In principle, one could imagine that an agency might itself seek to elicit these priors. In practice, however, the task of identifying plausible sources of auxiliary information, Dk , is already so challenging that it seems understandable that agencies might place such elicitation outside the bounds of their standard procedures. This appears to be the usual case in practice till now. In Reiter (2005), perhaps the most substantial practical application of Bayesian methods to identification risk assessment to date, intruders’ prior distributions have very little prominence. The main value of Bayesian methodology in current disclosure risk assessment practice seems more of a technical one, that it provides a clear conceptual way of integrating out uncertainty about parameters in the predictive distribution p(y|O, Dk , attack method k). Empirical evidence is needed to assess whether there are non-negligible practical differences between the resulting Bayesian measures and comparable frequentist model-based measures, as developed by Skinner and Shlomo (2008). 4.3 Implementation of Measures of Disclosure Potential Having addressed conceptual aspects of the predictive distributions in the previous two sections, we now turn to consider some ways in which an agency may implement measures of disclosure potential based on these distributions. This might be viewed as the problem of ‘estimating’ these distributions. We focus on the case of identity disclosure, where y is binary, in the context of the release of microdata. For some discussion of attribute disclosure (where y is the value of a variable for a target unit) see Duncan and Lambert (1986). We consider a method of attack where the intruder seeks to link a record in the microdata file to some external data source of known units using values of some variables, which are included in both the microdata and the external source. These variables are often called key variables and their values in the external data source define the auxiliary data D (Bethlehem 18 et al., 1990). A basic problem with estimating p(y|O, D, attack method) in this context is that D is hypothetical and thus unknown. There are two established approaches to specifying D and estimating the predictive distribution: • an empirical matching experiment - construct a surrogate file D, for which the true correspondence between the records in D and O is known by the agency, mimic the behavior of the intruder by using a record linkage algorithm to match D and O and record the proportion of correct matches; • a modelling approach - make assumptions about the nature of D (and the attack method) within a modelling framework which enables p(y|O, D, attack method) to be specified and then make inference about this distribution, given the data available to the agency. We consider these two approaches in turn. The empirical matching experiment cannot be used to estimate the probability of a correct match for a specific pair of records since all that is observed at this level is binary, either match or non-match. Hence, such an experiment will not provide a ‘record-level’ measure of disclosure potential. Instead, the proportion of correct matches across a set of records, possibly the whole file, provides an estimated probability, which may be treated as a ‘file-level’ or ‘subfile-level’ measure. For the estimate to be reliable, the number of records in the set will need to be sufficiently large. However, as a result of ‘smoothing’ across all these records, this approach may fail to identify the most ‘risky’ records. An advantage of the empirical matching approach is that it can accommodate any matching algorithm which an intruder might use, for example a deterministic record linkage approach (Spruill, 1982), and any SDL method and, in this sense, can avoid modelling assumptions. In particular, the approach does not depend upon assumptions about intruder perceptions and Lambert (1993) thus terms the empirical proportion the risk of true identification. A key practical challenge in an empirical matching experiment is how to construct a realistic surrogate intruder dataset, which allows for the disclosure protection arising from sampling and measurement differences between sources and for which there is some overlap of units with the microdata and the nature of this overlap is known. Sometimes there may be a suitable alternative data source (e.g. Blien et al., 1992) or a different survey undertaken by the agency, although agencies often control sampling to avoid such overlap. Even if there is overlap, determining which units are in common may be resource intensive, discouraging routine use of this approach. In the absence of another dataset, the agency may consider a ’re-identification’ experiment, in which the microdata file is matched against itself, normally after the application of some SDL method (Paass and Wauschkuhn, 1985, Paass, 1988 and Winkler, 2004). 19 The modeling approach may be formulated in the same conceptual framework as the empirical matching experiment, but seeks to obtain expressions for the predictive distributions via theoretical arguments, under assumptions about the nature of D and the attack method. Measures of disclosure potential associated with these expressions may then be estimated from the microdata. A practical disadvantage of this approach, compared to the empirical matching approach, is that it may not be theoretically straightforward to accommodate any specific matching algorithm which an intruder might use. Instead, approximating assumptions might be made. These may have the benefit of providing simpler or more interpretable measures of disclosure potential. An advantage of the modelling approach is that it permits the estimation of record-level measures of identifiability. A concern with file-level measures is that the principles governing confidentiality protection often seek to avoid the identification of any individual, that is require the probability to be below a threshold for each record, and such aims may not adequately be addressed by file-level measures. In contrast, record level measures, which take different values for each microdata record, may help identify those parts of the sample where disclosure potential is high and more protection is needed and may be aggregated to a file level measure in different ways if desired (Lambert, 1993). Model-based expressions for predictive distributions have tended to be studied separately for continuous and categorical key variables. We shall illustrate some points about predictive distributions in the categorical case. For continuous key variables, where random noise is added to the values of the key variables appearing in O, see e.g. Fuller (1983) who derives expressions for the relevant predictive distributions and discusses their estimation. Paass and Wauschkuhn (1985) and Paass (1988) discuss a general approach where identification is viewed as a classification problem and discriminant analysis techniques are used. Turning to the case when the key variables are categorical, suppose initially that they are recorded in the same way in D and O and that no SDL method is applied. In this case, a simple expression for the probability that an observed exact match between records in the two sources is correct is 1/F , where F is the number of units in the population which share the same key variable values as the two matching records (Duncan and Lambert, 1989; Skinner and Holmes, 1998). Skinner (2008) derives this expression under the assumption that the intruder employs a probabilistic record linkage method, of the type considered by Fellegi and Sunter (1969). The expression 1/F assumes that there is only a unique matching record in the microdata and that certain exchangeability assumptions hold about the F matching units. More importantly, it assumes that F is part of D and thus included in the conditioning set in p(y|O, D, attack method). We contend that this conditioning set 20 should consist of the information assumed to be available to the intruder (not the information available to the agency) and, thus, whether F should be part of the conditioning set depends on whether it is reasonable to suppose that the intruder knows F . In many realistic settings, it may be assumed that F is unknown to the intruder. In this case, p(y|O, D, attack method) may be expressed as E(1/F |O, D, attack method), where F is ’integrated out’ using its conditional distribution given what the intruder observes. Skinner and Holmes (1998) and Elamir and Skinner (2006) provide expressions for this probability assuming the key variables obey certain log-linear models and discuss how this probability may be estimated given only the sample microdata. Skinner and Shlomo (2008) discuss the specification of these models. The estimated probabilities may be viewed as record-level measures of disclosure potential. A simpler approach is obtained by assuming that the match observed by the intruder is obtained randomly (with equal probability) from all units in the population which match a microdata record which has a unique combination of key variable values in the sample. The resulting probability can be expressed as 1/F̄ , where F̄ is the mean value of F across all key combinations of values of key variables which are unique in the sample. This measure may be estimated simply from sample microdata (Skinner and Elliot, 2002; Skinner and Carter, 2003) and may be treated as a file level measure. Some related file-level measures, such as the proportion of ‘sample uniques’ that are ‘population unique’, are discussed by Bethlehem et al. (1990) and Fienberg and Makov (1998). Model-based assessment can become more complex when the output O has been subject to SDL methods. Reiter (2005) demonstrates how measures of identifiability can be obtained for a variety of SDL methods, including recoding, data swapping and adding random noise. Shlomo and Skinner (2010) consider the use of misclassification-based SDL methods. 5 Further Consideration of Disclosure Harm and Intruder Behaviour We have argued for separating the assessment of disclosure potential from the assessment of disclosure harm with a view to focussing the role of statistical science in the former task. Now, we make some remarks about the latter task. From the agency’s decision taking perspective, the key question is how to reduce disclosure harm through disclosure management approaches together with the SDL methods. We conceive of disclosure harm as the expected value of the loss incurred from the consequences of intruder behaviour. Assessing disclosure harm thus requires the assessment of three components: • the nature of potential intruder behaviour and its consequences; 21 • the uncertainty about intruder behaviour and the consequences; • the loss incurred from these consequences. Although it is clearly feasible for Bayesian decision theory to have a place in modelling intruder behaviour, as discussed in Section 3.2, we suggest that the scientific assessment of potential intruder behaviour in the face of alternative disclosure management (and SDL) approaches, is more a matter for social rather than statistical science. Assessing which kinds of people might attempt disclosure and the effects of approaches such as user training on the probability of a user attempting disclosure seem social science questions and may be addressed by empirical experiments. For example, O’Hara et al. (2011) describe an experiment where intruder behaviour was mimicked by recruiting postgraduate students who, like hackers in the real world, lacked knowledge of large-scale data handling and the SDL literature but had good computing skills and were driven by the aim of identifying subjects or disclosing further information about them in an anonymized microdata source. The experiment provided the agency with a better understanding of the kinds of attacks which intruders might employ and the kinds of threats arising from such behaviour. The assessment of uncertainty about potential intruder attacks and behaviour seems a very different task to the use of statistical inference to assess disclosure potential. Systematic modelling or assessment of such uncertainty seems more related to social science and risk management. Valuing the loss which would be incurred from specified consequences of intruder behaviour seems more a matter for policy judgement. Social science may inform this judgement, for example research into respondents’ perspectives and concerns about confidentiality. Different respondents and associated stakeholder groups, such as privacy organisations, do not all share a common perspective and dealing with these differences is a policy issue. Handling uncertainty about potential harmful outcomes is unlikely to be simply a matter of considering expected outcomes. Most agencies will also wish to take account of public perceptions of potential harm, in particular since these may adversely affect participation in surveys run by the agency (Singer, 2004; Couper et al., 2008, 2010). There are difficult challenges too in taking account of potential changes in public perceptions over time, in particular since the intruder behaviour and its consequences may occur well after the agency makes its release decision. The potential for sudden changes in public concerns about confidentiality was illustrated by the intense media coverage of losses of data discs by government in the United Kingdom (Hand, 2008). Issues in the management of public perceptions of the agency may also arise, but these are not ones of technical statistical science. 22 6 Implications for Agency Practice The conceptual framework in this paper may be used by an agency to structure its evaluation of SDL and disclosure management options. This framework takes account of the different kinds of expertise which are needed by staff undertaking the evaluation or are obtained through consultation with individuals and bodies outside the agency. The restriction of a broader notion of disclosure risk to the narrower one of disclosure potential is designed to enable it to be assessed by agency statisticians alone with methods of statistical science, by excluding consideration of matters like intruder behaviour, public perceptions of disclosure risk and false disclosure, which are relevant to the separate criterion of disclosure harm. The assessment of this latter criterion needs more multidisciplinary input, from social scientists, stakeholders representing respondents and policy makers (see Section 5). Assessment of utility, our third criterion, requires input from the users of the outputs. Overall decision making requires further policy input, such as through a microdata review panel, to take account of trade-offs between the three criteria. The evaluation of disclosure potential might be divided into three kinds of tasks. First, there is a need for ongoing assessment and updating of plausible scenarios and associated sources of auxiliary information which intruders might use. This is necessary background information for any assessment of disclosure potential. Second, evaluation will be required for routine decisions about release. In this case, it is appealing in practice for rules to be simple, transparent and objective. Some discussion of two types of approach to assessing disclosure potential was given in Section 4.3 for the case of identity disclosure in microdata: (i) empirical matching experiments have the advantage that they can handle a wide range of SDL methods although they may be somewhat elaborate for routine use, (ii) modelling approaches can provide the rationale for simpler measures, provided the SDL method can be handled tractably. Thresholds for levels of disclosure potential will need to be established from broader policy considerations. Harm considerations may lead to different thresholds as well as different disclosure management procedures for different kinds of survey variable. For example, health variables might be subject to more stringent rules. Third, detailed assessments may be undertaken as part of occasional in-depth exercises designed to choose between alternative major types of SDL approach or to validate simple disclosure protection rules. We proposed in Section 4 that disclosure potential be measured in terms of probabilistic prediction, whether of identity or unknown attributes, given the observable statistical outputs and hypothetical auxiliary data. Simple measures of disclosure potential might therefore be validated though simulation studies calibrating the probabilistic measures against their empirical performance. 23 7 Conclusions Statistical disclosure limitation is well established in statistical science as a body of theory and methods and remains the subject of active research. Although there is no shortage of SDL methods which have found application by government statistical agencies, a common scientific methodology for assessing disclosure risk and making decisions based upon these assessments has found less ready adoption in agency practice. This paper has addressed the foundations of this topic with the aim of clarifying what statistical science can contribute to such decision making and what it cannot. We have argued for the assessment of disclosure risk to be separated into the assessments of disclosure potential and disclosure harm, enabling the use of methods of statistical science to be focussed on the former task. Whilst we have recognized a role for statistical decision theory, we have sought to remove intruder behaviour from its ambit, viewing that as of more relevance to what we have called disclosure management. Nevertheless, in our more detailed consideration of disclosure potential, we have discussed how it may depend on the nature of potential intruder attacks. Our discussion of how to assess disclosure potential in practice has been limited to a prototypical set-up and there is certainly much scope for further research, as recommended by National Research Council (2005, p.72), taking account of the different kinds of SDL methods finding application under evolving modes of access. Acknowledgements I am grateful to Natalie Shlomo and two reviewers for comments. Research was supported by the Economic and Social Research Council. References Abowd, J.M., Nissim, K. & Skinner, C. (2009). First issue editorial. J. Privacy Confident., 1, 1-6. Bethlehem, J.G., Keller, W.J. & Pannekoek, J. (1990). Disclosure control for microdata. J. Amer. Statist. Assoc., 85, 38-45. Blien, U., Wirth, H. & Müller, M. (1992). Disclosure risk for microdata stemming from official statistics. Statist. Neerland., 46, 69-82. 24 Brandt, M., Lenz, R. & Rosemann, M. (2008). Anonymisation of panel enterprise microdata - survey of a German project. In Privacy In Statistical Databases, Lecture Notes In Computer Science 5262, Eds. J. Domingo-Ferrer & Y. Saygin, pp. 139-151. Berlin: Springer. Couper, M.P., Singer, E., Conrad, F.G. & Groves, R.M. (2008). Risk of disclosure, perceptions of risk and concerns about privacy and confidentiality as factors in survey participation. J. Official Statist., 24, 255-275. Couper, M.P., Singer, E., Conrad, F.G. & Groves, R.M. (2008). Experimental studies of disclosure risk, disclosure harm, topic sensitivity, and survey participation. J. Official Statist., 26, 287-300. Cox, L.H. (2001). Disclosure risk for tabular economic data. In Confidentiality, Disclosure And Data Access: Theory And Practical Applications For Statistical Agencies, Eds. P. Doyle, J.I. Lane, J.J.M. Theeuwes & L.V. Zayatz, pp. 167-183. Amsterdam: Elsevier. Cox, L.H., Karr, A.F. & Kinney, S.K. (2011). Risk-utility paradigms for statistical disclosure limitation: how to think, but not how to act (with discussion). Int. Statist. Rev., 79, 160-183. Dalenius, T. (1977). Towards a methodology for statistical disclosure control. Statistisk Tidskrift, 5, 429-444. DeGroot, M.H. (1970). Optimal Statistical Decisions. New York: Wiley. Dobra, A., Fienberg, S.E. & Trottini, M. (2003). Assessing the risk of disclosure of confidential categorical data. In Bayesian Statistics 7, Proceedings Of The Seventh Valencia International Meeting On Bayesian Statistics, Eds. J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith & M. West , pp. 125-144. Oxford University Press. Doyle, P., Lane, J.I., Theeuwes, J.J.M. & Zayatz, L.V. (2001). Confidentiality, Disclosure And Data Access: Theory And Practical Applications For Statistical Agencies. Amsterdam: Elsevier. Duncan, G.T., Elliot, M. & Salazar-González, J.-J. (2011). Statistical Confidentiality. New York: Springer. Duncan, G.T., Kelly-McNulty, S.A. & Stokes, S.L. (2001b). Disclosure risk vs. data utility: the R-U confidentiality map. Technical Report No. 121, National Institute Of Statistical Sciences, North Carolina. Duncan, G. & Lambert, D. (1986). Disclosure-limited data dissemination (with discussion). J. Amer. Statist. Assoc., 81, 10-28. 25 Duncan, G. & Lambert, D. (1989). The risk of disclosure for microdata. J. Bus. Econ. Statist., 7, 207-217. Dwork, C., McSherry, F., Nissim, K. & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Proceedings Of The 24th Annual International Cryptography Conference - CRYPTO, 528-544. New York: Springer. Elamir, E.A.H. & Skinner, C.J. (2006). Record level measures of disclosure risk for survey microdata. J. Official Statist., 22, 525-539. Elliot, M.J. & Dale, A. (1999). Scenarios of attach: the data intruder’s perspective on statistical disclosure risk. Netherlands Official Statist., 14, 6-10. Fellegi, I.P. (1972). On the question of statistical confidentiality. J. Amer. Statist. Assoc., 67, 7-18. Fellegi, I.P. & Sunter, A.B. (1969). A theory for record linkage. J. Amer. Statist. Assoc., 64, 1183-1210. Fienberg, S.E. (2006). Privacy and confidentiality in an e-commerce world: data mining, data warehousing, matching and disclosure limitation. Statist. Sci., 21, 143-154. Fienberg, S.E. & Makov, U.E. (1998). Confidentiality, uniqueness and disclosure limitation for categorical data. J. Official Statist., 14, 385-397. Fienberg, S.E., Makov, U.E. & Sanil, A.P. (1997). A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J. Official Statist., 13, 75-89. Frank, O. (1986). Comment on “Disclosure-limited data dissemination” by G. Duncan & D. Lambert. J. Amer. Statist. Assoc., 81, 21-22. French, S. (1988). Decision Theory: an Introduction to the Mathematics of Rationality. Chichester: Ellis Horwood. Fuller, W. (1983). Masking procedures for microdata disclosure limitation. J. Official Statist., 9, 383-406. Gotamam, S., Karr, A.F., Reiter, J.P. & Sanil, A.P. (2005a). Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access analysis servers. Statist. Sci., 20, 163-177. Gotamam, S., Karr, A.F. & Sanil, A.P. (2005b). Data swapping as a decision problem. J. Official Statist., 21, 635-655. 26 Government Statistical Service (2009) National Statistician’s Guidance: Confidentiality Of Official Statistics. Office For National Statistics, UK. Hand, D. J. (2008). Privacy, data discs and realistic risk. Significance, 5, 11-14. Karr, A.F., Kohnen, C.N., Organian, A., Reiter, J.P. & Sanil, A.P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. Amer. Statistician, 60, 224-232. Karr, A.F., Lin, X., Sanil, A.P. & Reiter, J.P. (2005). Secure regressions on distributed databases. J. Comput. Graphical Statist., 14, 263-279. Keeney, R.L. & Raiffa, J. (1976). Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: Wiley. Keller-McNulty, S., Nakhleh, C.W. & Singpurwalla, N.D. (2005). A paradigm for masking (camouflaging) information. Int. Statist. Rev., 73, 331-349. Lambert, D. (1993). Measures of disclosure risk and harm. J. Official Statist., 9 313-331. Marsh, C., Skinner, C., Arber, S., Penhale, B., Openshaw, S., Hobcraft, J., Lievesley, D. & Walford, N. (1993). The case for samples of anonymized records from the 1991 census. J. Roy. Statist. Soc. Ser. A, 154, 305-340. National Research Council (2005). Expanding Access To Research Data: Reconciling Risks And Opportunities. Panel On Data Access For Research Purposes, Committee On National Statistics. Washington DC: The National Academies Press. National Research Council (2007). Putting People On The Map: Protecting Confidentiality With Linked Social-Spatial Data. Panel On Confidentiality Issues Arising From The Integration Of Remotely Sensed And Self-Identifying Data, Eds. M.P. Guttmann & P.C. Stern. Washington DC: The National Academies Press. O’Hara, K., Whitley, E. & Whittall, P. (2011). Avoiding the jigsaw effect: experiences with Ministry of Justice reoffending Data. Technical Report , Electronics and Computer Science, University of Southampton. Paass, G. (1988). Disclosure risk and disclosure avoidance for microdata. J. Bus. Econ. Statist., 6, 487-500. Paass, G. & Wauschkuhn, U. (1985). Datenzugang, Datenschutz und Anonymisierung-Analysepotential und Identifizierbarkeit von Anonymisierten Individualdaten. Munich: Oldenbourg Verlag. 27 Reiter, J. (2005). Estimating risks of identification disclosure in microdata. J. Amer. Statist. Assoc., 100, 1103-1112. Reiter, J. (2009). Using multiple imputation to integrate and disseminate confidential microdata. Int. Statist. Rev., 77, 179195. Shlomo, N. & Skinner, C. (2010). Assessing the protection provided by misclassification-based disclosure limitation methods for survey microdata. Ann. Applied Statist., 4, 1291-1310. Singer, E. (2004). Principles and practices related to scientific integrity. In R.M. Groves, F.J. Fowler, M.P. Couper, J.M. Lepkowski and Eleanor Singer. Survey Methodology. New York: Wiley. Skinner, C.J. (2007). The probability of identification: applying ideas from forensic science to disclosure risk assessment. J. Roy. Statist. Soc. Ser. A, 170, 195-212. Skinner, C.J. (2008). Assessing disclosure risk for record linkage. In Privacy In Statistical Databases, Lecture Notes In Computer Science 5262, Eds. J. Domingo-Ferrer & Y. Saygin, pp. 166-176. Berlin: Springer. Skinner, C.J. & Carter, R.G. (2003). Estimation of a measure of disclosure risk for survey microdata under unequal probability sampling. Survey Methodology 29, 177-180. Skinner, C.J. & Elliot, M.J. (2002). A measure of disclosure risk for microdata. J. Roy. Statist. Soc., Ser. B, 64, 855-867. Skinner, C.J. & Holmes, D.J. (1998). Estimating the re-identification risk per record in microdata. J. Official Statist., 14, 361-372. Skinner, C.J. & Shlomo, N. (2008). Assessing disclosure risk in survey microdata using log-linear models. J. Amer. Statist. Assoc., 103, 989-1001. Spruill, N.L. (1982). Measures of confidentiality. Proc. Surv. Res. Sect. Amer. Statist. Ass., 260-265. Trottini, M. (2001). A decision-theoretic approach to data disclosure problems. Res. Official Statist., 4, 7-22. Trottini, M. (2003). Decision Models For Data Disclosure Limitation, PhD Thesis, Department Of Statistics, CarnegieMellon University. 28 Trottini, M. & Fienberg, S.C. (2002). Modelling user uncertainty for disclosure risk and data utility. Intern. J. Uncertainty, Fuzziness and Knowledge-Based Systems, 10, 511-527. Vaughan, E.J. (1997). Risk Management. New York: Wiley. Wasserman, L. & Zhou, S. (2010). A statistical framework for differential privacy. J. Amer. Statist. Assoc., 105, 375-389. Willenborg, L. & De Waal, T. (2001). Elements Of Statistical Disclosure Control. New York: Springer. Winkler, W.E. (2004). Masking and re-identification methods for public use microdata: overview and research problems. In Privacy In Statistical Databases, Lecture Notes In Computer Science 3050, Eds. J. Domingo-Ferrer & V. Torra, pp. 231-246. Berlin: Springer. Woo, M., Reiter, J.P., Organian, A. & Karr, A.F. (2009). Global measures of data utility for microdata masked for disclosure limitation. J. Privacy Confident., 1, 111-124. Zaslavsky, A. M. & Horton, N. J. (1998). Balancing disclosure risk against the loss of nonpublication. J. Official Statist., 14, 411-419. 29