Download Using the Latent Class Model to Analyse Misclassified Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Using the Latent Class Model to Analyse
Misclassified Data
Ardo van den Hout
&
Peter G.M. van der Heijden
Department of Methodology and Statistics
Utrecht University, The Netherlands
Content:
1. Misclassification as a mixture
2. Examples where misclassification probabilities
are known:
Post Randomisation Method (PRAM)
Randomized Response (RR)
3. Frequency estimation
4. Latent class model
5. Loglinear analysis of RR data and PRAM data
6. Conclusion
1. Misclassification (MC) as a mixture
· Let A be a binary variable and A* its observed
misclassified counterpart. Let conditional MC
probabilities be given by pij = P (A* = i | A = j) ,
i, j Î{1,2}. Then
P (A* = 1) = p11P(A =1) + p12P(A = 2) .
· Mixture formulation for A with categories 1,...,K
K
P(A* = i) = å P(A* = i | A = k )P(A = k )
k =1
where i Î{1,...,K}.
When pij 's are known it is a known component
density model. (Lindsay, 1995)
· General research question: how to analyse
misclassified data when pij 's are known?
Post Randomisation Method (PRAM)
PRAM is a method for Statistical Disclosure Control
(SDC)
Objective of SDC:
Protecting the privacy of respondents when
data matrices are released to outside users.
1
2
M
®
1
2
M
M
n
Original
sample
M
N
Population
A*1 ... A*p
A1 ... Ap
®
1
2
M
M
n
Released
sample
Identifying variables: variables that can be used
to identify a respondent.
Direct identifiers (name, address etc.)
are deleted from the microdata.
Indirect identifiers.
Example:
Variables: Place of Residence,
Gender,
Profession.
In the microdata an unsafe combination of scores:
Volendam ´ female ´ minister
· PRAM concerns the indirect identifiers in
unsafe combinations.
· PRAM is misclassification on purpose where
misclassification probabilities are fixed.
· The misclassification probabilities are released
together with the perturbed data matrix.
· Protection offered is uncertainty at individual
level.
References:
Warner (1971), Rosenberg (1979),
and Kooiman et al. (1997).
Randomized Response (RR)
Objective:
Protecting the privacy of respondents
when sensitive questions are asked.
A1 ... Ap
1
2
M
Misclassification
performed by
respondents
A*1 ... A*p
1
2
M
®
M
n
Latent
Status
M
n
Observed
Values
In our research concerning social benefit fraud, RR
questions were binary and the misclassification was
performed using playing cards
References: Warner (1965), Fox and Tracy (1986),
Chaudhuri and Mukerjee (1988), and Kuk (1990)
RR design by Kuk (1990)
· Two stack of cards: the yes-stack and the nostack.
· Each stack has a known distribution of red
cards and black cards.
· Respondent takes a card from each stack and
keeps the colours hidden.
· If latent answer to sensitive question is yes, the
respondent reveals the colour of the card he took
from the yes-stack, if the answer is no, the colour
of the card from the no-stack.
80%
yes-stack
red
yes
black
no
20%
Misclassification
· Protection offered by RR is the same as by
PRAM: uncertainty at the individual level.
Other examples with known misclassification
probabilities:
· Known specificity and sensitivity in statistics
concerning medicine.
· Disclosure control in data mining. PRAM-like
methods to protect privacy of web-users.
3. Example: frequency estimation
Consider binary variable A and let conditional
misclassification probabilites pij = P (A* = i | A = j)
be given by transition matrix
æ p11 p12 ö
÷÷.
P = çç
è p21 p22 ø
Let
æ t1 ö æ # latent scores with value 1ö
÷÷
t = çç ÷÷ = çç
è t 2 ø è # laten scores with value 2 ø
and
æ # observed scores with value 1 after MC ö
÷÷ .
t = çç
è # observed scores with value 2 after MC ø
*
Then
[
]
[
]
E T*1 | t = p11t1 + p12t 2
E T2* | t = p21t1 + p22 t 2
The equality
[
]
E T * | t = P t,
provides a unbiased moment estimator:
t̂ = P-1t*
(Chaudhuri and Mukerjee, 1988, Kooiman et al. ,
1997)
· The estimation of a multi-dimensional table goes
in the same way. (MC of latent variable A is
independent from MC of latent B.)
· Assuming a multinomial distribution: when the
moment estimator is in interior of parameter
space, it is the MLE. (Van den Hout and Van
der Heijden, 2002)
4. Latent Class Model (LCM)
Standard LCM with 1 latent variable X, and 3
manifest variables S, T, U:
STU
p stu
= å p xX p sS| x| X p tT| x| X p uU| x| X
x
where, e.g.,
STU
p stu
= P[ S = s, T = t ,U = u ]
p sS| x| X = P[ S = s | X = x ]
· Local independence
· Number of categories of X unknown
· Model might be not-identified
LCM for PRAM data and RR data
· Assume that latent variables A, B, and C are
misclassified in observed A*, B*, and C*,
respectively. LCM is given by
p
A* B *C *
a *b * c *
= åp
ABC
abc
p
abc
A* | A
a * |a
p
B* | A
b * |b
p
C * |C
c * |c
,
where conditional probabilities are given and
number of categories of a latent variable are
known.
· External variables can be included easily:
p
A* B *C * P
a *b * c * p
= åp
abc
ABCP
abcp
p
A* | A
a * |a
p
B* | A
b * |b
p
C * |C
c * |c
.