* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Using the Latent Class Model to Analyse Misclassified Data
Survey
Document related concepts
Transcript
Using the Latent Class Model to Analyse Misclassified Data Ardo van den Hout & Peter G.M. van der Heijden Department of Methodology and Statistics Utrecht University, The Netherlands Content: 1. Misclassification as a mixture 2. Examples where misclassification probabilities are known: Post Randomisation Method (PRAM) Randomized Response (RR) 3. Frequency estimation 4. Latent class model 5. Loglinear analysis of RR data and PRAM data 6. Conclusion 1. Misclassification (MC) as a mixture · Let A be a binary variable and A* its observed misclassified counterpart. Let conditional MC probabilities be given by pij = P (A* = i | A = j) , i, j Î{1,2}. Then P (A* = 1) = p11P(A =1) + p12P(A = 2) . · Mixture formulation for A with categories 1,...,K K P(A* = i) = å P(A* = i | A = k )P(A = k ) k =1 where i Î{1,...,K}. When pij 's are known it is a known component density model. (Lindsay, 1995) · General research question: how to analyse misclassified data when pij 's are known? Post Randomisation Method (PRAM) PRAM is a method for Statistical Disclosure Control (SDC) Objective of SDC: Protecting the privacy of respondents when data matrices are released to outside users. 1 2 M ® 1 2 M M n Original sample M N Population A*1 ... A*p A1 ... Ap ® 1 2 M M n Released sample Identifying variables: variables that can be used to identify a respondent. Direct identifiers (name, address etc.) are deleted from the microdata. Indirect identifiers. Example: Variables: Place of Residence, Gender, Profession. In the microdata an unsafe combination of scores: Volendam ´ female ´ minister · PRAM concerns the indirect identifiers in unsafe combinations. · PRAM is misclassification on purpose where misclassification probabilities are fixed. · The misclassification probabilities are released together with the perturbed data matrix. · Protection offered is uncertainty at individual level. References: Warner (1971), Rosenberg (1979), and Kooiman et al. (1997). Randomized Response (RR) Objective: Protecting the privacy of respondents when sensitive questions are asked. A1 ... Ap 1 2 M Misclassification performed by respondents A*1 ... A*p 1 2 M ® M n Latent Status M n Observed Values In our research concerning social benefit fraud, RR questions were binary and the misclassification was performed using playing cards References: Warner (1965), Fox and Tracy (1986), Chaudhuri and Mukerjee (1988), and Kuk (1990) RR design by Kuk (1990) · Two stack of cards: the yes-stack and the nostack. · Each stack has a known distribution of red cards and black cards. · Respondent takes a card from each stack and keeps the colours hidden. · If latent answer to sensitive question is yes, the respondent reveals the colour of the card he took from the yes-stack, if the answer is no, the colour of the card from the no-stack. 80% yes-stack red yes black no 20% Misclassification · Protection offered by RR is the same as by PRAM: uncertainty at the individual level. Other examples with known misclassification probabilities: · Known specificity and sensitivity in statistics concerning medicine. · Disclosure control in data mining. PRAM-like methods to protect privacy of web-users. 3. Example: frequency estimation Consider binary variable A and let conditional misclassification probabilites pij = P (A* = i | A = j) be given by transition matrix æ p11 p12 ö ÷÷. P = çç è p21 p22 ø Let æ t1 ö æ # latent scores with value 1ö ÷÷ t = çç ÷÷ = çç è t 2 ø è # laten scores with value 2 ø and æ # observed scores with value 1 after MC ö ÷÷ . t = çç è # observed scores with value 2 after MC ø * Then [ ] [ ] E T*1 | t = p11t1 + p12t 2 E T2* | t = p21t1 + p22 t 2 The equality [ ] E T * | t = P t, provides a unbiased moment estimator: tĚ‚ = P-1t* (Chaudhuri and Mukerjee, 1988, Kooiman et al. , 1997) · The estimation of a multi-dimensional table goes in the same way. (MC of latent variable A is independent from MC of latent B.) · Assuming a multinomial distribution: when the moment estimator is in interior of parameter space, it is the MLE. (Van den Hout and Van der Heijden, 2002) 4. Latent Class Model (LCM) Standard LCM with 1 latent variable X, and 3 manifest variables S, T, U: STU p stu = å p xX p sS| x| X p tT| x| X p uU| x| X x where, e.g., STU p stu = P[ S = s, T = t ,U = u ] p sS| x| X = P[ S = s | X = x ] · Local independence · Number of categories of X unknown · Model might be not-identified LCM for PRAM data and RR data · Assume that latent variables A, B, and C are misclassified in observed A*, B*, and C*, respectively. LCM is given by p A* B *C * a *b * c * = åp ABC abc p abc A* | A a * |a p B* | A b * |b p C * |C c * |c , where conditional probabilities are given and number of categories of a latent variable are known. · External variables can be included easily: p A* B *C * P a *b * c * p = åp abc ABCP abcp p A* | A a * |a p B* | A b * |b p C * |C c * |c .