Download Slides are available here.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Prenatal nutrition wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Transcript
Challenges in data linkage:
error and bias
Katie Harron
October 2014
UCL Institute of Child Health
[email protected]
The linkage problem
Match status
Link
status
Match
Non-match
(pair from same individual)
(pair from different individuals)
Link
Identified match
False match
Non-link
Missed match
Identified non-match
1
2
3
Deterministic linkage in Hospital
Episode Statistics (HES)
–
–
–
Sex
Date of Birth
NHS Number
–
–
–
–
Sex
Date of Birth
Postcode
Local Patient Identifier
within Provider
–
–
–
Sex
Date of Birth
Postcode
Few falsematches
More
missedmatches
Quality of unique identifiers
166,406 records of admissions to
paediatric intensive care (PICANet)
85,137 non-matches
81,269 matches
46 (0.1%)
same NHS number
3,207 (4%)
different NHS numbers
Hagger-Johnson et al. Causes and consequences of data linkage errors: False and missed
matches following linkage of hospital data (under review)
Deterministic linkage with
pseudonymisation at source
Courtesy of Peter Jones, ONS Beyond 2011 programme
Probabilistic linkage
pair 1
pair 2
pair 3
Low match
weight
Highest
weight is
retained
High match
weight
Primary File
Ronald Fisher
Linking File
Karl Pearson
Carl Gauss
Ronald Fisher
Probabilistic linkage
pair 3
Low match
weight
High match
weight
P(γ=1 | M) = m-probability = sensitivity
the probability of agreement given the
records from same subject
Log ratio = w =
Highest
weight is
retained
log2 (m/u)
log2 [(1-m)/(1-u)]
P(γ=1 | U) = u-probability= 1-specificity
the probability of agreement given the
records from different subjects
if identifiers agree
if identifiers disagree
Match weight = W = ∑wi
Probabilistic linkage
Matches
agreement
on NHS
number
Non-matches
agreement
on sex
Low match
weight
disagreement
on date of
birth
High match
weight
agree on some ids
disagree on some ids
Chance
(same date
of birth)
Missing data
Recording
errors
Missed
matches
Matches
Non-matches
Low match
weight
False
matches
High match
weight
Links Links
Two thresholds
Evaluating linkage quality
Small amounts of linkage error can result in substantially
biased results
The impact of linkage error on results is rarely reported
Linkage error affects different types of analysis in
different ways
Why it’s important to evaluate linkage error
Schmidlin et al (2013) Impact of unlinked deaths and coding changes on mortality trends
in the Swiss National Cohort. BMC Med Inform Decis Mak 13 (1):1
Why it’s important to evaluate linkage error
Hobbs, G. & Vignoles, A., 2007. Is free school meal status a valid proxy for socio-economic status (in schools research)?
Centre for the Economics of Education; London School of Economics and Political Science.
Why it’s important to evaluate linkage error
Highly
sensitive
Lariscy (2011). Differential Record Linkage by Hispanic Ethnicity and Age in Linked
Mortality Studies: Implications for the Epidemiologic Paradox. J Aging Health
23(8):1263-84
Highly
specific
Why it’s important to evaluate linkage error
Ford et al 2006. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge
data. Paediatric and Perinatal Epidemiology 20(4): 329-337.
Evaluating linkage quality
i) Sensitivity analysis using
different linkage criteria
Highly
sensitive
iii) Comparisons of linked
and unlinked data
ii) Subset of gold-standard
data to quantify linkage bias
Highly
specifi
c
iv) Imputation for uncertain
links
Imputation for linkage
Primary
file
Linking file
Variable
of
interest
Record 1
Exact match
high
high
Record 2
Exact match
low
low
Record 3
Exact match
high
high
Match weight=10
med
Match weight=1
low
Match weight=5
low
high
high
high
Record 4
Record n
Match weight=4
Match weight=4
Match weight=3
Goldstein et al Stat Med 2012;31(28):3481-3493
Harron et al BMC Med Res Method 2014;14(1):36
Prior-informed
imputation
med
high
Implications for data providers
i) Sensitivity analysis using
different thresholds
ii) Subset of gold-standard
data to quantify linkage bias
Availability of all
candidate records
(linked and
unlinked)
iii) Comparisons of linked and
unlinked data
iv) Imputation for uncertain
links
Subset of data
where true match
status is known
(gold-standard)
Harron et al 2012. Opening the black box of record linkage. J Epidemiol Commun H 66(12):1198
Summary
 Data linkage is a powerful tool for enhancing administrative
data
 Linkage error has important effects on analyses
 Results vary according to choice of thresholds and methods
 Taking error into account is possible without releasing
identifiable data
 Communication between linkers and data users is vital
Acknowledgements and funding
Harvey Goldstein, Ruth Gilbert, Gareth Hagger-Johnson and Angie Wade,
UCL Institute of Child Health
Berit Muller-Pebody,
Public Health England
Roger Parslow, Tom Fleming, Lee Norman and the PICANet team,
University of Leeds
This work was supported by funding from the National Institute for Health Research Health Technology Assessment
(NIHR HTA) programme (project number 08/13/47). The views and opinions expressed therein are those of the
authors and do not necessarily reflect those of the HTA programme, NIHR, NHS or the Department of Health. The
authors state no conflicts of interest.