Download Using the Latent Class Model to Analyse Misclassified Data

Using the Latent Class Model to Analyse Misclassified Data Ardo van den Hout & Peter G.M. van der Heijden Department of Methodology and Statistics Utrecht University, The Netherlands Content: 1. Misclassification as a mixture 2. Examples where misclassification probabilities are known: Post Randomisation Method (PRAM) Randomized Response (RR) 3. Frequency estimation 4. Latent class model 5. Loglinear analysis of RR data and PRAM data 6. Conclusion 1. Misclassification (MC) as a mixture · Let A be a binary variable and A* its observed misclassified counterpart. Let conditional MC probabilities be given by pij = P (A* = i | A = j) , i, j Î{1,2}. Then P (A* = 1) = p11P(A =1) + p12P(A = 2) . · Mixture formulation for A with categories 1,...,K K P(A* = i) = å P(A* = i | A = k )P(A = k ) k =1 where i Î{1,...,K}. When pij 's are known it is a known component density model. (Lindsay, 1995) · General research question: how to analyse misclassified data when pij 's are known? Post Randomisation Method (PRAM) PRAM is a method for Statistical Disclosure Control (SDC) Objective of SDC: Protecting the privacy of respondents when data matrices are released to outside users. 1 2 M ® 1 2 M M n Original sample M N Population A*1 ... A*p A1 ... Ap ® 1 2 M M n Released sample Identifying variables: variables that can be used to identify a respondent. Direct identifiers (name, address etc.) are deleted from the microdata. Indirect identifiers. Example: Variables: Place of Residence, Gender, Profession. In the microdata an unsafe combination of scores: Volendam ´ female ´ minister · PRAM concerns the indirect identifiers in unsafe combinations. · PRAM is misclassification on purpose where misclassification probabilities are fixed. · The misclassification probabilities are released together with the perturbed data matrix. · Protection offered is uncertainty at individual level. References: Warner (1971), Rosenberg (1979), and Kooiman et al. (1997). Randomized Response (RR) Objective: Protecting the privacy of respondents when sensitive questions are asked. A1 ... Ap 1 2 M Misclassification performed by respondents A*1 ... A*p 1 2 M ® M n Latent Status M n Observed Values In our research concerning social benefit fraud, RR questions were binary and the misclassification was performed using playing cards References: Warner (1965), Fox and Tracy (1986), Chaudhuri and Mukerjee (1988), and Kuk (1990) RR design by Kuk (1990) · Two stack of cards: the yes-stack and the nostack. · Each stack has a known distribution of red cards and black cards. · Respondent takes a card from each stack and keeps the colours hidden. · If latent answer to sensitive question is yes, the respondent reveals the colour of the card he took from the yes-stack, if the answer is no, the colour of the card from the no-stack. 80% yes-stack red yes black no 20% Misclassification · Protection offered by RR is the same as by PRAM: uncertainty at the individual level. Other examples with known misclassification probabilities: · Known specificity and sensitivity in statistics concerning medicine. · Disclosure control in data mining. PRAM-like methods to protect privacy of web-users. 3. Example: frequency estimation Consider binary variable A and let conditional misclassification probabilites pij = P (A* = i | A = j) be given by transition matrix æ p11 p12 ö ÷÷. P = çç è p21 p22 ø Let æ t1 ö æ # latent scores with value 1ö ÷÷ t = çç ÷÷ = çç è t 2 ø è # laten scores with value 2 ø and æ # observed scores with value 1 after MC ö ÷÷ . t = çç è # observed scores with value 2 after MC ø * Then [ ] [ ] E T*1 | t = p11t1 + p12t 2 E T2* | t = p21t1 + p22 t 2 The equality [ ] E T * | t = P t, provides a unbiased moment estimator: t̂ = P-1t* (Chaudhuri and Mukerjee, 1988, Kooiman et al. , 1997) · The estimation of a multi-dimensional table goes in the same way. (MC of latent variable A is independent from MC of latent B.) · Assuming a multinomial distribution: when the moment estimator is in interior of parameter space, it is the MLE. (Van den Hout and Van der Heijden, 2002) 4. Latent Class Model (LCM) Standard LCM with 1 latent variable X, and 3 manifest variables S, T, U: STU p stu = å p xX p sS| x| X p tT| x| X p uU| x| X x where, e.g., STU p stu = P[ S = s, T = t ,U = u ] p sS| x| X = P[ S = s | X = x ] · Local independence · Number of categories of X unknown · Model might be not-identified LCM for PRAM data and RR data · Assume that latent variables A, B, and C are misclassified in observed A*, B*, and C*, respectively. LCM is given by p A* B *C * a *b * c * = åp ABC abc p abc A* | A a * |a p B* | A b * |b p C * |C c * |c , where conditional probabilities are given and number of categories of a latent variable are known. · External variables can be included easily: p A* B *C * P a *b * c * p = åp abc ABCP abcp p A* | A a * |a p B* | A b * |b p C * |C c * |c .

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using the Latent Class Model to Analyse Misclassified Data