Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Challenges in data linkage: error and bias Katie Harron October 2014 UCL Institute of Child Health [email protected] The linkage problem Match status Link status Match Non-match (pair from same individual) (pair from different individuals) Link Identified match False match Non-link Missed match Identified non-match 1 2 3 Deterministic linkage in Hospital Episode Statistics (HES) – – – Sex Date of Birth NHS Number – – – – Sex Date of Birth Postcode Local Patient Identifier within Provider – – – Sex Date of Birth Postcode Few falsematches More missedmatches Quality of unique identifiers 166,406 records of admissions to paediatric intensive care (PICANet) 85,137 non-matches 81,269 matches 46 (0.1%) same NHS number 3,207 (4%) different NHS numbers Hagger-Johnson et al. Causes and consequences of data linkage errors: False and missed matches following linkage of hospital data (under review) Deterministic linkage with pseudonymisation at source Courtesy of Peter Jones, ONS Beyond 2011 programme Probabilistic linkage pair 1 pair 2 pair 3 Low match weight Highest weight is retained High match weight Primary File Ronald Fisher Linking File Karl Pearson Carl Gauss Ronald Fisher Probabilistic linkage pair 3 Low match weight High match weight P(γ=1 | M) = m-probability = sensitivity the probability of agreement given the records from same subject Log ratio = w = Highest weight is retained log2 (m/u) log2 [(1-m)/(1-u)] P(γ=1 | U) = u-probability= 1-specificity the probability of agreement given the records from different subjects if identifiers agree if identifiers disagree Match weight = W = ∑wi Probabilistic linkage Matches agreement on NHS number Non-matches agreement on sex Low match weight disagreement on date of birth High match weight agree on some ids disagree on some ids Chance (same date of birth) Missing data Recording errors Missed matches Matches Non-matches Low match weight False matches High match weight Links Links Two thresholds Evaluating linkage quality Small amounts of linkage error can result in substantially biased results The impact of linkage error on results is rarely reported Linkage error affects different types of analysis in different ways Why it’s important to evaluate linkage error Schmidlin et al (2013) Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort. BMC Med Inform Decis Mak 13 (1):1 Why it’s important to evaluate linkage error Hobbs, G. & Vignoles, A., 2007. Is free school meal status a valid proxy for socio-economic status (in schools research)? Centre for the Economics of Education; London School of Economics and Political Science. Why it’s important to evaluate linkage error Highly sensitive Lariscy (2011). Differential Record Linkage by Hispanic Ethnicity and Age in Linked Mortality Studies: Implications for the Epidemiologic Paradox. J Aging Health 23(8):1263-84 Highly specific Why it’s important to evaluate linkage error Ford et al 2006. Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatric and Perinatal Epidemiology 20(4): 329-337. Evaluating linkage quality i) Sensitivity analysis using different linkage criteria Highly sensitive iii) Comparisons of linked and unlinked data ii) Subset of gold-standard data to quantify linkage bias Highly specifi c iv) Imputation for uncertain links Imputation for linkage Primary file Linking file Variable of interest Record 1 Exact match high high Record 2 Exact match low low Record 3 Exact match high high Match weight=10 med Match weight=1 low Match weight=5 low high high high Record 4 Record n Match weight=4 Match weight=4 Match weight=3 Goldstein et al Stat Med 2012;31(28):3481-3493 Harron et al BMC Med Res Method 2014;14(1):36 Prior-informed imputation med high Implications for data providers i) Sensitivity analysis using different thresholds ii) Subset of gold-standard data to quantify linkage bias Availability of all candidate records (linked and unlinked) iii) Comparisons of linked and unlinked data iv) Imputation for uncertain links Subset of data where true match status is known (gold-standard) Harron et al 2012. Opening the black box of record linkage. J Epidemiol Commun H 66(12):1198 Summary Data linkage is a powerful tool for enhancing administrative data Linkage error has important effects on analyses Results vary according to choice of thresholds and methods Taking error into account is possible without releasing identifiable data Communication between linkers and data users is vital Acknowledgements and funding Harvey Goldstein, Ruth Gilbert, Gareth Hagger-Johnson and Angie Wade, UCL Institute of Child Health Berit Muller-Pebody, Public Health England Roger Parslow, Tom Fleming, Lee Norman and the PICANet team, University of Leeds This work was supported by funding from the National Institute for Health Research Health Technology Assessment (NIHR HTA) programme (project number 08/13/47). The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the HTA programme, NIHR, NHS or the Department of Health. The authors state no conflicts of interest.