* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Jerry`s presentation on risk measures
Data analysis wikipedia , lookup
Computer simulation wikipedia , lookup
Generalized linear model wikipedia , lookup
Regression analysis wikipedia , lookup
Birthday problem wikipedia , lookup
Least squares wikipedia , lookup
Pattern recognition wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Corecursion wikipedia , lookup
Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA Measures of identification disclosure risk Number of population uniques: Does not incorporate intruders’ knowledge. May not be useful for continuous data. Hard to gauge effects of SDL procedures. Hard to estimate accurately. Probability-based methods (Direct matching using external databases. Indirect matching using existing data set.) Require assumptions about intruder behavior. May be costly to obtain external databases. Notation for methods Actual record j : y j (y , y ) Released record j : z j (z , z ) Available data: z (z , z ) Unavailable + perturbed data combined: U j U j A j z (z , z ) C j U j Ap j Ap j A j A j Ad j Probability of identification Let J = j when record j in Z matches the target record, t. J = r + 1 when target is not in Z. Pr( J j | t, Z) Pr( ZC | J j , t, ZAd ) Pr( J j | t, ZAd ) r 1 Pr( Z j 1 C | J j , t, Z ) Pr( J j | t, Z ) Ad Ad Calculating Pr( J j | t, Z ) Ad CASE 1: Target assumed to be in Z: Ad j do Units whose z not match target’s values have zero probability. For matches, probability equals 1/nt where nt is number of matches in Z. Probability equals zero for j = r+1. Calculating Pr( J j | t, Z ) Ad CASE 2: Target not assumed to be in Z: Ad j Units whose z do not match target’s values have zero probability. For matches, probability is 1/Nt where Nt is number of matches in pop’n. For j = r+1, probability is (Nt – nt) / Nt Splitting Pr( Z | J j, t, Z) C Pr( Z | J j , t, Z) C Pr(z Ap j | J j , t, Z ) Ad Pr(z | z , J j , t, Z ) U j Ap j Ad Pr( z ,, z , z ,, z | z , J j , t, Z ) C 1 C j-1 C j1 C r C j Ad Calculating Pr(z Ap j | J j, t, Z ) Ad Data swapping: Repeatedly simulate swapping mechanism using Z. Estimate probabilities for combinations of original + swapped values. Calculating Pr(z Ap j | J j, t, Z ) Ad Noise addition: Assume variable k perturbed using Gaussian noise with mean zero and known variance σ2. Pr(z Ap jk | J j, t, Z ) N(z jk | t jk , ) Ad 2 Calculating Pr(z | z , J j, t, Z ) U j Ap j Ad Pr( z | z , t, Z ) U j Pr(z A j U j Ad U j A j Ad U j A j Ad | y , z , t, Z ) Pr( y | z , t, Z )dy First distribution is for SDL methods. Second distribution is best model for predicting unavailable variables given what is known. U j Calculating Pr(z | z , J j, t, Z ) U j Ap j Ad Pr(z | z , t, Z ) 1 U j A j Ad when values in U are not perturbed. Intruders may act this way to avoid computations. It is prudent to evaluate risk assuming they do. Calculating Pr(z ,, z , z ,, z | z , J j, t, Z ) C 1 C j-1 C j1 C r C j Ad Assume independence to obtain: Pr(z C i Ad i |z ) i j where Pr( z | z ) Pr( z | y , z ) Pr( y | z )dy C i Ad i C i C i Ad i C i Ad i C i Simulations 51,016 heads of household from 2000 CPS. Potentially available variables: Age, Sex, Race, Marital Status, Property Tax Unavailable variables: Education, Income, Social Security, Child Support Payments Simulations: SDL Procedures Age: Group in five year intervals. Race and Marital Status: Swap randomly 30% of values for each variable. Property taxes: For positive taxes, add noise from N(0, 2902). Constrain values to be positive. Do not alter 0s. Other variables: Leave at original values. Simulations: Targets Everyman : has values near median for all Unique : Sample unique on combination of Big I : Highest income in data set. Big P : Highest property tax in data set. variables. age, sex, race, marital status. Simulations: Summary of results Swaps needed to protect Unique. Age recode plus swaps good protection. Knowing property taxes greatly increases probabilities of identification. Adding noise to positive tax values is not sufficient. (Top-coding helps.)