Download Jerry`s presentation on risk measures

Estimating Identification Risks for Microdata Jerome P. Reiter Institute of Statistics and Decision Sciences Duke University, Durham NC, USA Measures of identification disclosure risk  Number of population uniques: Does not incorporate intruders’ knowledge. May not be useful for continuous data. Hard to gauge effects of SDL procedures. Hard to estimate accurately.  Probability-based methods (Direct matching using external databases. Indirect matching using existing data set.) Require assumptions about intruder behavior. May be costly to obtain external databases. Notation for methods Actual record j : y j  (y , y ) Released record j : z j  (z , z )  Available data: z  (z , z )  Unavailable + perturbed data combined:   U j U j A j z  (z , z ) C j U j Ap j Ap j A j A j Ad j Probability of identification   Let J = j when record j in Z matches the target record, t. J = r + 1 when target is not in Z. Pr( J  j | t, Z)  Pr( ZC | J  j , t, ZAd ) Pr( J  j | t, ZAd ) r 1  Pr( Z j 1 C | J  j , t, Z ) Pr( J  j | t, Z ) Ad Ad Calculating Pr( J  j | t, Z ) Ad CASE 1: Target assumed to be in Z:    Ad j do Units whose z not match target’s values have zero probability. For matches, probability equals 1/nt where nt is number of matches in Z. Probability equals zero for j = r+1. Calculating Pr( J  j | t, Z ) Ad CASE 2: Target not assumed to be in Z:    Ad j Units whose z do not match target’s values have zero probability. For matches, probability is 1/Nt where Nt is number of matches in pop’n. For j = r+1, probability is (Nt – nt) / Nt Splitting Pr( Z | J  j, t, Z) C Pr( Z | J  j , t, Z)  C Pr(z Ap j | J  j , t, Z ) Ad  Pr(z | z , J  j , t, Z ) U j Ap j Ad  Pr( z ,, z , z ,, z | z , J  j , t, Z ) C 1 C j-1 C j1 C r C j Ad Calculating Pr(z  Ap j | J  j, t, Z ) Ad Data swapping: Repeatedly simulate swapping mechanism using Z. Estimate probabilities for combinations of original + swapped values. Calculating Pr(z  Ap j | J  j, t, Z ) Ad Noise addition: Assume variable k perturbed using Gaussian noise with mean zero and known variance σ2. Pr(z Ap jk | J  j, t, Z )  N(z jk | t jk ,  ) Ad 2 Calculating Pr(z | z , J  j, t, Z ) U j Ap j Ad Pr( z | z , t, Z )  U j  Pr(z   A j U j Ad U j A j Ad U j A j Ad | y , z , t, Z ) Pr( y | z , t, Z )dy First distribution is for SDL methods. Second distribution is best model for predicting unavailable variables given what is known. U j Calculating Pr(z | z , J  j, t, Z ) U j Ap j Ad Pr(z | z , t, Z )  1 U j A j Ad when values in U are not perturbed. Intruders may act this way to avoid computations. It is prudent to evaluate risk assuming they do. Calculating Pr(z ,, z , z ,, z | z , J  j, t, Z ) C 1  C j-1 C j1 C r C j Ad Assume independence to obtain:  Pr(z C i Ad i |z ) i j where Pr( z | z )   Pr( z | y , z ) Pr( y | z )dy C i Ad i C i C i Ad i C i Ad i C i Simulations  51,016 heads of household from 2000 CPS.  Potentially available variables: Age, Sex, Race, Marital Status, Property Tax  Unavailable variables: Education, Income, Social Security, Child Support Payments Simulations: SDL Procedures     Age: Group in five year intervals. Race and Marital Status: Swap randomly 30% of values for each variable. Property taxes: For positive taxes, add noise from N(0, 2902). Constrain values to be positive. Do not alter 0s. Other variables: Leave at original values. Simulations: Targets  Everyman : has values near median for all  Unique : Sample unique on combination of  Big I : Highest income in data set.  Big P : Highest property tax in data set. variables. age, sex, race, marital status. Simulations: Summary of results     Swaps needed to protect Unique. Age recode plus swaps good protection. Knowing property taxes greatly increases probabilities of identification. Adding noise to positive tax values is not sufficient. (Top-coding helps.)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Jerry`s presentation on risk measures