Download How Much Does Automatic Text De-Identification Impact Clinical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
How Much Does Automatic Text De-Identification Impact
Clinical Problems, Tests, and Treatments?
Stéphane M. Meystre, MD, PhD1,3, Óscar Ferrández, PhD4,
Brett R. South, MS1,2.3, Shuying Shen, MStat1,2,3, Matthew H. Samore, MD1,2,3
1
Department of Biomedical Informatics, 2 Department of Internal Medicine, University of Utah, 3 VA
Health Care System, Salt Lake City, UT, 4 Nuance Communications Inc., Burlington, MA
Abstract: Clinical text de-identification can potentially overlap with clinical information such as medical problems or
treatments, therefore causing this information to be lost. In this study, we focused on the analysis of the overlap between
the 2010 i2b2 NLP challenge concept annotations, with the PHI annotations of our best-of-breed clinical text deidentification application. Overall, 0.81% of the annotations overlapped exactly, and 1.78% partly overlapped.
Introduction: Clinical text de-identification (i.e., removal of all Protected Health Information (PHI)), as defined in the
HIPAA Safe Harbor legislation, allows clinical notes to be used for research without patient consent, a requirement often
difficult, if not impossible, to fulfill. We have developed and evaluated a best-of-breed clinical text automatic deidentification application for VHA clinical notes (aka “BoB”),1 and realized that the potential for impacting clinical data
was not negligible. We added a clinical eponyms disambiguation module in BoB, and started several experiments
focusing on the impact of de-identification on subsequent uses of clinical notes, the first of which is presented here.
Methods: For accessibility and annotations availability reasons, we chose to use the 2010 i2b2 NLP challenge corpus
and reference standard for this early study. The 2010 i2b2 challenge focused on the annotation of medical problems,
tests, and treatments, and well as on their local context assessment (e.g., “…denied chest pain”), and the extraction of
specific relations between these concepts.2 We then used “BoB”, our clinical text de-identification application, to
automatically annotate all PHI, as well as clinical eponyms, in this corpus, and then analyzed the overlap of these new
annotations with the 2010 i2b2 NLP challenge reference standard annotations.
Results: The 2010 i2b2 NLP challenge corpus included a total of 47685 annotations; 849 partly overlapped with BoB’s
PHI annotations, and 386 exactly overlapped. BoB correctly reclassified 112 clinical eponyms (e.g., Parkinson,
Pfannenstiel, Holter, Foley, Whipple, Roux) to obtain the aforementioned counts. Overall, an average of 0.81% of BoB’s
annotations overlapped exactly with the i2b2 annotations, and 1.78% partly overlapped. Most overlaps (76%) were
annotated as person names, and among these overlaps, 45% of the total were treatment annotations (e.g., Colace,
Lopressor, Senna, Hickman), 19% were problem annotations (e.g., E. Coli, Fournier, Addison), and 12% were test
annotations (e.g., Apgars, Papanicolaou).
i2b2 categories
Problem
Test
Treatment
i2b2 annot.
19667
13833
14185
PHI overlap #
187
180
482
Partial overlap
Eponyms
18
41
53
Overlap [%]
0.95
1.30
3.40
PHI overlap #
65
40
281
Exact overlap
Eponyms
5
11
2
Overlap [%]
0.33
0.29
1.98
Conclusion: This early study demonstrates that even an efficient text de-identification system like BoB can cause
clinical information to be mistakenly considered as PHI and hidden or removed. This overlap is small, but not negligible.
Another recent detailed study focused on the impact of text de-identification on the subsequent automatic extraction of
medication names, and found no significant impact,3 but medications represent only a small part of the clinical
information found in clinical notes, and a minority of the overlapping information we analyzed. Our plans are now to
focus on the analysis of the impact of de-identification on subsequent uses of VHA clinical notes.
Acknowledgments: Research supported by VA HSR HIR 08-374. Views expressed are those of authors and not
necessarily those of the Department of VA or affiliated institutions.
References
1. Ferrandez O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM. BoB, a best-of-breed automated text de-identification
system for VHA clinical documents. JAMIA. 2012. Sep.4.
2. Uzuner O, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. JAMIA.
2011.Aug.16;18(5):552–6.
3. Deleger L, Molnar K, Savova G, Xia F, Lingren T, Li Q, et al. Large-scale evaluation of automated clinical note de-identification
and its impact on information extraction. JAMIA. 2012. Aug.2.
177