Download An Error Detecting and Tagging Framework for Reducing Data Entry

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
2013 IEEE International Conference on Bioinformatics and Biomedicine
An Error Detecting and Tagging Framework for
Reducing Data Entry Errors in Electronic Medical
Records (EMR) System
Yuan Ling, Yuan An
Mengwen Liu, Xiaohua Hu
College of Computing and Informatics, Drexel University
Philadelphia, USA
{yl638, ya45}@drexel.edu
College of Computing and Informatics, Drexel University
Philadelphia, USA
{ml943, xh29}@drexel.edu
Abstract—we develop an error detecting and tagging
framework for reducing data entry errors in Electronic Medical
Records (EMR) systems. We propose a taxonomy of data errors
with three levels: Incorrect Format and Missing error, Out of
Range error, and Inconsistent error. We aim to address the
challenging problem of detecting erroneous input values that
look statistically normal but are abnormal in medical sense.
Detecting such an error needs to take patient medical history and
population data into consideration. In particular, we propose a
probabilistic method based on the assumption that the input
value for a field depends on the historical records of this field,
and is affected by other fields through dependency relationships.
We evaluate our methods using the data collected from an EMR
System. The results show that the method is promising for
automatic data entry error detection.
Electronic Medical Records (EMR) data errors have been
classified as incomplete, inaccurate, or inconsistent [9]. More
specifically, based on EMR visit notes, the previous study [10]
classifies the data entry errors into Inconsistent Information,
Miscellaneous Errors, Missing Sections, Incomplete
Information, and Incorrect Information [10]. For our case, we
conducted a comprehensive literature review. As a result, we
summarized different levels of data errors in a more detailed
manner. Three levels of data entry errors are proposed as
shown in Table I. Example errors are discussed in the
following paragraphs.
TABLE I.
Level 1:
Incorrect Format and
Missing
Level 2:
Out of Range
Keywords—Data Entry Errors; Electronic Medical Records
(EMR); Error Detecting; Error Tagging
I.
INTRODUCTION
L1-1: Incorrect Format Data
L1-2: Missing Data
L2-1: Out of normal range
L2-2: Out of population trends
L2-3: Out of personal trend
Currently, large-scale data analysis is conducted in almost
every area. Advanced analytical process and data mining
techniques have been used to discover valuable information
and knowledge from large-scale data. However, most data
mining techniques assume that data has already been cleaned
[1]. This is not always true in real situations. Data preparation
process and data quality issues are critical to reliable data
mining results, because data with errors will compromise the
credibility of the data mining results. The data lifecycle of an
application typically consists of the following steps: data
collection, transformation, storage, auditing, cleaning, and
analyzing to making decisions [2]. Data errors can be reduced
in each step of the data lifecycle. Since data collection is the
first step of the lifecycle, reducing data entry errors in the data
collection stage is the earliest opportunity to enable data
quality.
Level 3:
Inconsistent
L3-1: Personal Inconsistent
L3-2: Population Inconsistent
L1-1 errors can be easily controlled with prior knowledge
about the data format. Constraints can be set for each field. For
example, only numbers are allowed to be entered into blood
pressure field. L2-1 error also can be easily controlled as long
as there are knowledge about the normal range of each field.
For example, height values can be constrained in the range of 0
to 3 meters. Detecting L1-2 errors, however, is a difficult
research task, but we can learn to detect missing data according
to clinical guidelines and caregiving practices. Once the
missing values are detected, we can set constraints to this
particular field as we did for L1-1 and L2-1. For L2-2 errors,
using statistics for detecting erroneous values for a single
attribute is well-studied and robust results can be obtained. For
example, L2-2 errors can be calculated using a univariate
outlier detection technique called Hampel X84 [11]. In this
paper, we do not focus on solving the L1-1 to L2-2 error
problems.
Error and outlier detection have been studied in various
application domains, such as credit card fraud detection[3],
network intrusion detection[4], and data cleaning in data
warehouse[5]. In general, errors can be classified into three
types: point errors, contextual errors, and collective errors [6].
The challenge lies in the detection of values that look
statistically normal but are abnormal in medical sense like L23, L3-1, and L3-2 errors. For example, the weight value of a
patient increases to 230lbs from 180lbs in three months. The
weight value of 230lbs is in the normal population distribution
In the healthcare domain, data errors in electronic medical
databases have been studied recently [7, 8]. Generally,
978-1-4799-1310-7/13/$31.00 ©2013 IEEE
LEVELS OF DATA ENTRY ERRORS
249
range, but for this particular patient, it is suspicious if he/she
has a healthy living style and does not have major medical
problems. It is categorized as a L2-3 error. The following
illustrates other examples. A record showing a male patient has
the status of pregnancy would be considered as a L3-1 error.
An obese and diabetic patient with BMI (Body mass index)
larger than 35 is likely to have a cardiovascular disease
manifested as hypertension and dyslipidemia. If the patient’s
blood pressure values are significantly low, but still in the
normal range, the blood pressure value would be inconsistent
with the BMI value from the population perspective. This
signals a L3-2 type of error.
4) We test our method on dataset from an EMR system.
Our experimental results show that the predictive accuracy of
our probabilistic model is better than Bayesian Network. Our
model also generates better results than the other method
assuming the input value for a field is based solely on
historical records.
The rest of the paper is organized as follows: Section II
describes related work. Section III presents our error detecting
and tagging framework. Section IV introduces our probabilistic
model and error detecting method. Section V presents our
evaluation process and experimental results. Section VI
presents our conclusions and future research directions.
In this paper, we propose an error detecting and tagging
mechanism in real time to address these challenges based on
historical data and dependency relationships. The entire
framework consists of three main components:
II.
RELATED WORK
Statistical methods have been used to analyze and detect
data errors at data cleaning stage [12-14]. However, we need
some methods to reduce date entry error in the data collection
stage[15].
1) Find a similar patient group. Error detection is based on
historical datasets. For a single patient, the historical datasets
should be composed from patients who have equivalent
medical situations, clinical conditions, and basic demographic
information. Because it is difficult to find a group of patients
with exactly the same medical conditions, we detect errors
based on a dataset from patients with similar conditions.
2) Learn a probabilistic model. Based on the dataset from
a group of similar patients, we learn parameters for our
proposed model. The probabilistic model is used to caculate
the input value probability for each field.
3) Return a tag. Error tagging is based on results from the
probabilistic model. If an input value probability for a field is
relatively high, then the value has a higher chance to be a
correct value and a “Normal” tag will be returned. Whereas if
the probability is relatively low, the tag is “Suspicious”. We
use a threshold to decide which tag should be returned.
Our experimental results show that our error detecting and
tagging method is promising. The predictive accuracy of the
probabilistic model is between 70% and 85% using our
experimental data (explained in the Evaluation section). We
also compare our method with existing methods. The results
indicate that our method has a better performance.
There are different ways to detect errors and control data
quality, including classification based neural networks,
Bayesian networks[16], support vector machine, rule-based
methods, unsupervised clustering, and instance-based methods.
The Usher system is developed to detect errors on form entry
fields by using Bayesian network and a graphical model with
explicit error modeling[17]. The system includes a
probabilistic error model based on a Bayesian network for
estimating contextualized error likelihood for each field on a
form. The method uses Bayesian network to learn dependency
relationships among fields, but it does not take historical
records for a particular field into consideration. In this paper,
we detect errors based on dependency relationships among
fields as well as historical data for a particular field. In
addition, we classify patients based on some basic information
to cluster them into similar groups before we apply error
detection. As our method is probabilistic in nature, selecting
the right dataset for estimating the probabilities is critical. For
example, it is unreasonable to predict the input value for a
person whose age is 50 using a dataset for people whose ages
are around 10.
Using tagging mechanism has been shown to decrease
certain types of errors [18]. Data entry interfaces can be
designed to help improving data quality. We propose a system
that classifies the input values of each field as “Suspicious” or
“Normal”, and provides this information as feedback to the
user. If the “Suspicious” tag occurs, then it reminds the user to
check the input value. Consequently, it would help reducing
data entry errors.
In summary, the contributions of our paper are:
1) We summarize different levels of data entry errors
based on a comprehensive literature review. Our method
focuses on detecting and tagging errors that cannot be detected
by simple statistical methods.
2) We design a probabilistic model based on two
assumptions: first, the current patient’s input value for a field
depends on his/her historical records in this field; and second,
the input value for a field depends on values from other
fields.We combine the influences from these two aspects in
our model.
3) We develop an error detecting and tagging framework
with five stages: selecting a classification model, finding
similar patient group, building a probabilistic model, setting a
threshold, and returning tags.
III.
ERROR DETECTING AND TAGGING FRAMEWORK
A. Error Detecting and Tagging Framework
Our error detecting and tagging framework is presented in
Fig. 1. A data entry interface contains multiple fields. After a
user inputs a value for a field, a tag (“Normal” or
“Suspicious”) will be returned.
250
probabilistic model and the threshold setting, we return a
“Normal” or “Suspicious” tag to the current user.
IV.
Fig. 1. Error Detecting and Tagging
Framework
DATA ENTRY ERROR DETECTING METHOD
A. Data Entry Error Detecting Method
Given a data entry interface I , let F = {Fv , … , F j , … , Fk } be
a set of input fields on the interface I . For example, in Fig. 4,
the field for entering the “Weight” value can be represented
as FWeight .
Fig. 2. Histroical Records for
Error Detecting Task
Here we use a simple example to illustrate the historical
data format and the error detecting task. Fig. 2 shows a set of
records for 3 different patients. There are five fields for each
record: Date (Observation Date), BP_S (Systolic Blood
Pressure value), BP_D (Diastolic Blood Pressure value), BMI,
and Weight. Our error detecting task is to tell whether the
patient with ID = 3 has a normal or suspicious input value for
BP_S, BP_D, BMI and Weight at the observation date:
1/13/2010. The feedback is generated based on the former 8
records from the patient with ID = 3, and also the other 11
records from patients with ID = 1 and ID = 2, assuming these
three patients have similar conditions.
Fig. 4. A Data Entry Form
Fig. 5. An example of relationships
among fields
For a single field Fv , we can use a sequence of values
v = {v1 , …, vi −1 , vi } to represent all historical records for a
patient, where vi is the current value entered by a user for the
field Fv , {v1 ,…, vi −1} is the historical record sequence for the
field Fv and vi −1 is the value entered for the field Fv at the
(i − 1)th time.
Based on the data, we will learn dependency relationships
among BP_S, BP_D, BMI and Weight. These relationships are
captured in a probabilistic graphical model.
We assume that a patient’s current input value for a field
depends on historical input records of this field. Abnormal
changes of input values comparing with historical records need
to get attention. In probability theory, such a sequentially
dependent process can be described by a Markov model.
According to 1th order Markov assumption, the current value
for the field Fv at i th time depends only on the value for the
same field at (i − 1)th time. Therefore, for a giving input value vi ,
we can use the value vi −1 to predict the probability of the value
B. Error Detecting Process
Based on our error detecting and tagging framework, we
propose the error detecting process as shown in Fig. 3. A
patient is modeled as a set of data fields describing the
patient’s demographic information and clinical conditions. A
form is used to collect a set of specific types of information
during an encounter, for example, blood pressure, weight, etc.
For our method, we divide the entire patient information into
two categories: independent fields which are considered as
fixed during the error detection process and dependent field
which will be collected and flagged as “normal” or
“suspicious”.
vi for the field Fv at the current time. The probability can be
represented as qm (vi | vi −1 ) . Similarly, we can have a probability
according to 2th order Markov assumption as qm (vi | vi −1 , vi−2 ) .
But for a new patient, there are no historical records for value
of field Fv . So we can assume that the current value vi for the
field Fv at i th time does not depend on previous historical data
from the patient. Therefore the predicted probability can be
represented as qm (vi ) .
Our goal is to predict the probability of an input value vi for
Fig. 3. Error Detecting Process
the field Fv , which can be represented as qM (vi ) . Using 1th order
As shown in Fig. 3, our error detecting process consists of
the following five steps. First, patient records are categorized
into groups based on information about the basic independent
fields from the current patient and other historical records
using a classification model. Second, we find a similar patient
group for the current patient. Third, we build a probabilistic
model based on clinical fields and historical records from a
group of similar patients. Fourth, we set a threshold for the
probability of input value. Finally, based on results from the
Markov assumption, 2th order Markov assumption, and a prior
probability, we can calculate qM (vi ) as in Equation (1):
q M ( v i | all _ conditions ) =
α 3 × q m ( v i | v i −1 , v i − 2 , all _ conditions )
+ α 2 × q m ( v i | v i −1 , all _ conditions )
+ α 1 × q m ( v i | all _ conditions )
251
(1)
“all_conditions” means that the probability is calculated
based on a dataset only from patients that have equal
demographic and clinical conditions. In the following
equations, we will omit “all_conditions” in each term assuming
that they have been taken into consideration. The three
parameters α1 , α 2 , and α3 are used to balance the three
probabilities [24], and α1 + α 2 + α 3 = 1 . Given a set of records,
we can estimate each part of probability according to Equation
(2), Equation (3), and Equation (4), respectively.
qm (vi | vi −1 , vi − 2 ) =
qm (vi | vi−1 ) =
qm (vi ) =
N (vi − 2 , vi −1 , vi )
N (vi − 2 , vi −1 )
N (vi−1 , vi )
N (vi−1 )
N (vi )
N (all )
Network as Fig. 5. By plugging in the above values, we
compute a final probability q (v i | all _ conditions ) .
B. Error Tagging Method
We assume that for each field Fv the set of possible values
is finite and discrete. If the values for a field are continuous, we
apply approapriate methods for discretizing them. Let
V = {vi' , vi'' ,…} be the set of possible values for a field Fv . The
current input value vi at the i th time is a member of the set. For
each possible input values in the set V , we can calculate
probability based on the Equation (5). We get a tag result for
current input value vi for the field Fv as Equation (6).
(2)
(3)
(4)
⎧ Normal
tag (vi ) = ⎨
⎩Suspicious
For Equation (2) and Equation (3), N (vi − 2 , vi −1 ) is the
occurrence number of input value vi − 2 for field Fv at
the (i − 2)th time followed by the input value vi −1 for field Fv at
the (i − 1)th time for a patient. Similarly, N (vi −2 , vi −1 , vi ) is the
occurrence number of input value in a sequence of {vi − 2 , vi −1 , vi }
(6)
q(vi ) < θ
For the threshold value θ in the Equation (6), there are
methods that can be applied to set the threshold. For example,
80/20 rule, Linear Utility Function [22] and Maximum
Likelihood Estimation[23]. In this paper, we will use 80/20
rule to dynamically decide the threshold value. In particular,
probability locates before 20% highest values means the
corresponding value is “Normal”, otherwise it is “Suspicious”.
for the (i − 2)th , (i − 1)th , and i th time for a patient.
For Equation (4), N (all ) is the total number of values for
field Fv for historical records from all similar patients. Similar
patients means a group of patients who are similar to the
current patient. The similarity is measured in terms of their
demographic and clinical conditions such as age, gender,
disease conditions, medications, treatments, social activities,
self-management, etc. N (vi ) is the occurrence number of vi for
field Fv from all historical records based on all similar patients.
V.
EVALUATION
The previous sections described the theoretical derivations
of the model and computational steps. In this section, we
present an experiment using a set of limited data for evaluating
the performance of the model. We aimed to gain some firsthand practical confidence through a pilot study. The limitation
of the data is two-fold. First, we only obtained the data for a
small number of people, about 1448. Second, we consider a
patient model with a small number of fields including age, sex,
BP, weight, and BMI. The experiment is by no means thorough
and complete because the real model for describing a patient is
far more complicated than just sevaral fields. The purpose of
the experiment is to demonstrate the effectiveness of the
probabilistic model for error detection taking historical data
and inter-field dependency relationships into consideration.
The Equation (1) only uses the historical records from a
single field. However, there are dependency relationships
among different fields on a data entry interface I . The
relationships among different fields can be obtained based on
prior knowledge from experts or learned from historical data
[17]. The relationships can be learned from historical records
with a Bayesian network using the Max-Min Hill-Climbing
(MMHC) method [19]. We implemented the method using the
R packages: bnlearn [20] and gRain [21]. An example of
learned relationships among fields is showed in Fig. 5. A
Conditional Probability Table (CPT) will also be learned for
the Bayesian Network.
We obtained the data from a health clinic affiliated with
Drexel University in Philadephia, PA, USA. All the data are
de-identified for HIPAA (Health Insurance Portability and
Accountability Act) compliance. We only obtained records
with six fields without any additional clinical information.
For the example shown in Fig. 5, the value vi of Fv depends
on the value Fw and Fu , so we can represent the probability
as qB (vi | wi , ui ) . By adding the probability of qB (vi | wi , ui ) to the
Equation (1), we get the following Equation (5):
q (v i ) = β × q M (vi ) + γ × q B (v i | wi , u i )
q(vi ) ≥ θ
A. Data Sets
We selected six fields from databases: Age, Sex, Diastolic
Blood Pressure (BP_D) value, Systolic Blood Pressure (BP_S)
value, Weight, and Body Mass Index (BMI). we extracted
1448 patients with 12199 records from all these six fields from
the EMR system. Because our dataset is limited and patients’
information is insufficient, we cannot guarantee all patients
have equal clinical conditions and treatments in this small scale
preliminary study. Only Age and Sex are used as basic fields to
conduct classification process to find a similar patients group
for current patient. Based on the two basic attributes: age and
sex. Patients are divided into 14 categories and the numbers of
(5)
where the two parameters β and γ , β + γ = 1 .
We already get the value of qM (vi ) by Equation (1). We can
get the value of q B (vi | wi , ui ) from CPT of the learned Bayesian
252
patients for each category are shown in Table II. The average
number of historical records for each patient is 8.4. We
examine our data entry error detecting method for each group.
TABLE II.
Group
1
2
3
4
5
6
7
8
9
10
11
12
13
14
The average predictive accuracy for all groups is 71%. The
predictive power is stable for all these groups, which ranges
from 70% to 85%, except for group1, group13, and group14.
Their numbers of records are lower than 200, so their
predictive accuracy are relatively lower.
16 KINDS OF SIMILAR PATIENTS
Age
[20, 30)
[30, 40)
[40, 50)
[50, 60)
[60, 70)
[70, 80)
[80, 90)
Sex
M
F
M
F
M
F
M
F
M
F
M
F
M
F
a.
We implement Usher’s error model [17] on our dataset.
Usher induced the probability of making an error using an error
model. The comparative results of Usher and our method is
shown in Fig. 8. Usher’s average accuracy is 63.2%, while our
method is 71%. Except for group 1 and group 13 in which the
numbers of records are too small, our method has better
performance than Usher.
Patients/Records
Number
30/153
63/567
68/328
132/973
122/643
188/1967
191/1513
270/2552
103/850
181/1592
20/195
55/658
4/36
21/172
Since the number of records and accuracy of group 8 is
relative high. We test different sets of parameters on the dataset
of group 8. The different parameter sets and corresponding
accuracy results are as shown as Table III and Fig. 9. There are
9 different sets of parameters.
TABLE III.
“M” stands for “Male”, and “F” stands for “Female”
BP_D, BP_S, Weight, and BMI are transformed to discrete
value and used as clinical fields to build a Bayesian Network to
find the inter-field dependency relationship among these
clinical fields. In our experiment, we built a unified Bayesian
network using all 12199 records. The Max-Min Hill-Climbing
(MMHC) method is used to generate the Bayesian Network
structure as shown in Fig. 6.
Fig. 6. Learned Bayesian Network
α1
α2
α3
β
γ
1
2
3
4
5
6
7
8
9
0
1/3
0
1/3
1/3
0.25
0.2
0.1
0
0
1/3
0
1/3
1/3
0.25
0.2
0.1
0
1
1/3
0
1/3
1/3
0.5
0.6
0.8
1
1
1
0
0.75
0.5
0.5
0.5
0.6
0.5
0
0
1
0.25
0.5
0.5
0.5
0.4
0.5
Accuracy
0.736813
0.748621
0.757969
0.754706
0.75221
0.780027
0.783213
0.782445
0.791291
As shown in Table III, in Set 1 and Set 2, β = 1 and γ = 0 ,
which means that prediction only depends on past historical
values of the current single field. In set 3, β = 0 and γ = 1 ,
which means that prediction only depends on other fields, but
does not depend on historical values of current single field.
That is, the result of set 3 is the result of only using Bayesian
network. The accuracy results from set 1, 2 and 3 are relatively
low. In summary, the accuracy results obtained by the methods
purely depending on past historical values of the current single
field or on dependency relationships among multiple fields
learned from Bayesian network are not as good as the results
obtained by combining these two aspects.
Fig. 7. Predictive Accuracy and
Number of Records of 14 Groups
The variable “BP_D” depends on the variable “BP_S”. The
joint probability between “BP_D” and “BP_S” can be
calculated from the CPT (Conditional Probability Table) of
Bayesian Network.
B. Experimental Results
Based on our dataset, we used our method to predict the
value of variable “BP_D”. The predictive power of our method
is measured in individual 3-fold cross validation experiments
for each group. The average predictive accuracy and number of
records for each group is shown in Fig. 7. The accuracy is
calculated as correctly predicted instance number divided by
total instance number. The parameters set here are
α1 = α 2 = α 3 = 1 / 3 , β = 3 / 4 , and γ = 1 / 4 .
Fig. 8. Comparative Performance of
Usher with Our Method
ACCURACY RESULTS FOR DIFFERENT PARAMTERS SET
Set
In Fig. 9, sets 6, 7, 8, and 9 have relatively higher accuracy.
In these four sets, parameters β and α 3 have higher weight.
That means prediction depending on all former 2 stages of
historical values of the current single field and dependency
relationships among multiple fields improves the accuracy. The
set 9 has highest accuracy. But in practice, it is not
recommended, because for some small datasets, α 3 might equal
to 0. The recommended parameter sets are Set 6, 7, and 8.
We also conduct an experiment to compare the
performance of our model to that of the Hampel X84[11].
Based on our method, the “Normal” ranges for a patient’s
“BP_D” value can dynamically narrow down to [70, 90] as
shown in Fig. 10. a), while the Hampel X84 gave “Normal” to
a broader range [45.3, 114.7] as shown in Fig. 10. b) for all
population. The narrower range indicates that our method is
more sensitive to error detection.
Fig. 9. Accuracy Comparison for
Different Parameters Set
253
[6]
[7]
[8]
a) Our method
b) Hampel X84
Fig. 10. Our method compared with Hampel X84
[9]
VI.
DISCUSSION AND FUTURE WORK
This paper proposes a five-step error detecting and tagging
framework for reducing data entry errors that pose significant
challenges to existing statistic methods. Our probabilistic
model for error detecting is based on the assumption that input
values for a field can be predicated using historical data and
inter-field dependency relationships among fields.
[10]
[11]
[12]
We conducted pilot experiments to evaluate our method.
Results show that our probabilistic model has better accuracy
than a Bayesian network, it is also better than the method based
on a Markov assumption that the input value for a field is
based only on historical records of this single field. We also
compare our method with Usher’s error model. The result
shows our method has a better performance. Moreover, our
method has more precise results than the Hampel X84 method.
[13]
[14]
Our future research work will focus on following aspects:
[15] Goldberg, S.I., et al., A Weighty Problem: Identification, Characteristics
and Risk Factors for Errors in EMR Data. AMIA Annu Symp
Proc, 2010. 2010: p. 251-5.
[16] Wong, W., et al. Bayesian network anomaly pattern detection for
disease outbreaks. in MACHINE LEARNING-INTERNATIONAL
WORKSHOP THEN CONFERENCE-. 2003.
[17] Chen, K., et al., Usher: Improving data quality with dynamic forms.
Knowledge and Data Engineering, IEEE Transactions on, 2011.
23(8): p. 1138-1153.
[18] Chen, K., J.M. Hellerstein, and T.S. Parikh. Designing adaptive
feedback for improving data entry accuracy. in Proceedings of the
23nd annual ACM symposium on User interface software and
technology. 2010. ACM.
[19] Tsamardinos, I., L.E. Brown, and C.F. Aliferis, The max-min hillclimbing Bayesian network structure learning algorithm. Machine
learning, 2006. 65(1): p. 31-78.
[20] Scutari, M., Learning Bayesian networks with the bnlearn R package.
arXiv preprint arXiv:0908.3817, 2009.
[21] Højsgaard, S., Graphical Independence Networks with the gRain
package for R, 2012, Journal.
[22] Arampatzis, A., et al., KUN on the TREC-9 Filtering Track:
Incrementality, decay, and threshold optimization for adaptive
filtering systems. 2000.
[23] Zhang, Y. and J. Callan. Maximum likelihood estimation for filtering
thresholds. in Proceedings of the 24th annual international ACM SIGIR
conference on Research and development in information retrieval. 2001.
ACM.
1) Currently, our tagging method can only return two kinds
of tags as “Normal” and “Suspicious”. We would like to extend
our model to detect multiple errors as shown in Table I.
2) Our error detecting method only focuses on discrete
input value. We plan to improve our method to deal with
continuous input values.
3) We plan to conduct a more thorough experimental
evaluation by collecting more patient data with more attributes
in the future.
ACKNOWLEDGMENT
This work is supported in part by a Drexel Jumpstart grant
on Health Informatics and the NSF grant IIP 1160960 for the
center for visual and decision informatics (CVDI).
REFERENCES
[1]
[2]
[3]
[4]
[5]
Chandola, V., A. Banerjee, and V. Kumar, Anomaly Detection: A
Survey. Acm Computing Surveys, 2009. 41(3).
Arts, D.G., N.F. De Keizer, and G.-J. Scheffer, Defining and improving
data quality in medical registries: a literature review, case study,
and generic framework. Journal of the American Medical
Informatics Association, 2002. 9(6): p. 600-611.
Beretta, L., et al., Improving the quality of data entry in a low-budget
head injury database. Acta Neurochirurgica, 2007. 149(9): p. 903909.
Phillips, W. and Y. Gong, Developing a nomenclature for emr errors, in
Human-Computer Interaction. Interacting in Various Application
Domains. 2009, Springer. p. 587-596.
Khare, R., An, Y., Wolf, S., Nyirjesy, P., Liu, L., & Chou, E,
Understanding the EMR error control practices among
gynecologic physicians. iConference 2013 Proceedings, p.289-301.
Hellerstein, J.M., Quantitative data cleaning for large databases. United
Nations Economic Commission for Europe (UNECE), 2008.
Nahm, M.L., C.F. Pieper, and M.M. Cunningham, Quantifying Data
Quality for Clinical Trials Using Electronic Data Capture. Plos
One, 2008. 3(8).
Goldberg, S.I., A. Niemierko, and A. Turchin, Analysis of data errors in
clinical research databases. AMIA Annu Symp Proc, 2008: p.
242-6.
Jacobs, B., Electronic medical record, error detection, and error
reduction: a pediatric critical care perspective. Pediatr Crit Care
Med, 2007. 8(2 Suppl): p. S17-20.
Dasu, T. and T. Johnson, Exploratory data mining and data cleaning.
Vol. 479. 2003: Wiley-Interscience.
Batini, C. and M. Scannapieca, Data quality. 2006: Springer.
Chan, P.K., et al., Distributed data mining in credit card fraud
detection. Intelligent Systems and their Applications, IEEE, 1999.
14(6): p. 67-74.
Lazarevic, A., et al., A comparative study of anomaly detection schemes
in network intrusion detection. Proc. SIAM, 2003.
Rahm, E. and H.H. Do, Data cleaning: Problems and current
approaches. IEEE Data Eng. Bull., 2000. 23(4): p. 3-13.
[24] Michael Collins. Language Modelling, http://www.cs.columbia.edu/~mco
llins/. 2013.
254