* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Using the Latent Class Model to Analyse Misclassified Data
Survey
Document related concepts
Transcript
Using the Latent Class Model to Analyse
Misclassified Data
Ardo van den Hout
&
Peter G.M. van der Heijden
Department of Methodology and Statistics
Utrecht University, The Netherlands
Content:
1. Misclassification as a mixture
2. Examples where misclassification probabilities
are known:
Post Randomisation Method (PRAM)
Randomized Response (RR)
3. Frequency estimation
4. Latent class model
5. Loglinear analysis of RR data and PRAM data
6. Conclusion
1. Misclassification (MC) as a mixture
· Let A be a binary variable and A* its observed
misclassified counterpart. Let conditional MC
probabilities be given by pij = P (A* = i | A = j) ,
i, j Î{1,2}. Then
P (A* = 1) = p11P(A =1) + p12P(A = 2) .
· Mixture formulation for A with categories 1,...,K
K
P(A* = i) = å P(A* = i | A = k )P(A = k )
k =1
where i Î{1,...,K}.
When pij 's are known it is a known component
density model. (Lindsay, 1995)
· General research question: how to analyse
misclassified data when pij 's are known?
Post Randomisation Method (PRAM)
PRAM is a method for Statistical Disclosure Control
(SDC)
Objective of SDC:
Protecting the privacy of respondents when
data matrices are released to outside users.
1
2
M
®
1
2
M
M
n
Original
sample
M
N
Population
A*1 ... A*p
A1 ... Ap
®
1
2
M
M
n
Released
sample
Identifying variables: variables that can be used
to identify a respondent.
Direct identifiers (name, address etc.)
are deleted from the microdata.
Indirect identifiers.
Example:
Variables: Place of Residence,
Gender,
Profession.
In the microdata an unsafe combination of scores:
Volendam ´ female ´ minister
· PRAM concerns the indirect identifiers in
unsafe combinations.
· PRAM is misclassification on purpose where
misclassification probabilities are fixed.
· The misclassification probabilities are released
together with the perturbed data matrix.
· Protection offered is uncertainty at individual
level.
References:
Warner (1971), Rosenberg (1979),
and Kooiman et al. (1997).
Randomized Response (RR)
Objective:
Protecting the privacy of respondents
when sensitive questions are asked.
A1 ... Ap
1
2
M
Misclassification
performed by
respondents
A*1 ... A*p
1
2
M
®
M
n
Latent
Status
M
n
Observed
Values
In our research concerning social benefit fraud, RR
questions were binary and the misclassification was
performed using playing cards
References: Warner (1965), Fox and Tracy (1986),
Chaudhuri and Mukerjee (1988), and Kuk (1990)
RR design by Kuk (1990)
· Two stack of cards: the yes-stack and the nostack.
· Each stack has a known distribution of red
cards and black cards.
· Respondent takes a card from each stack and
keeps the colours hidden.
· If latent answer to sensitive question is yes, the
respondent reveals the colour of the card he took
from the yes-stack, if the answer is no, the colour
of the card from the no-stack.
80%
yes-stack
red
yes
black
no
20%
Misclassification
· Protection offered by RR is the same as by
PRAM: uncertainty at the individual level.
Other examples with known misclassification
probabilities:
· Known specificity and sensitivity in statistics
concerning medicine.
· Disclosure control in data mining. PRAM-like
methods to protect privacy of web-users.
3. Example: frequency estimation
Consider binary variable A and let conditional
misclassification probabilites pij = P (A* = i | A = j)
be given by transition matrix
æ p11 p12 ö
÷÷.
P = çç
è p21 p22 ø
Let
æ t1 ö æ # latent scores with value 1ö
÷÷
t = çç ÷÷ = çç
è t 2 ø è # laten scores with value 2 ø
and
æ # observed scores with value 1 after MC ö
÷÷ .
t = çç
è # observed scores with value 2 after MC ø
*
Then
[
]
[
]
E T*1 | t = p11t1 + p12t 2
E T2* | t = p21t1 + p22 t 2
The equality
[
]
E T * | t = P t,
provides a unbiased moment estimator:
t̂ = P-1t*
(Chaudhuri and Mukerjee, 1988, Kooiman et al. ,
1997)
· The estimation of a multi-dimensional table goes
in the same way. (MC of latent variable A is
independent from MC of latent B.)
· Assuming a multinomial distribution: when the
moment estimator is in interior of parameter
space, it is the MLE. (Van den Hout and Van
der Heijden, 2002)
4. Latent Class Model (LCM)
Standard LCM with 1 latent variable X, and 3
manifest variables S, T, U:
STU
p stu
= å p xX p sS| x| X p tT| x| X p uU| x| X
x
where, e.g.,
STU
p stu
= P[ S = s, T = t ,U = u ]
p sS| x| X = P[ S = s | X = x ]
· Local independence
· Number of categories of X unknown
· Model might be not-identified
LCM for PRAM data and RR data
· Assume that latent variables A, B, and C are
misclassified in observed A*, B*, and C*,
respectively. LCM is given by
p
A* B *C *
a *b * c *
= åp
ABC
abc
p
abc
A* | A
a * |a
p
B* | A
b * |b
p
C * |C
c * |c
,
where conditional probabilities are given and
number of categories of a latent variable are
known.
· External variables can be included easily:
p
A* B *C * P
a *b * c * p
= åp
abc
ABCP
abcp
p
A* | A
a * |a
p
B* | A
b * |b
p
C * |C
c * |c
.