Download E.S.S.E. Editing Systems Standard Evaluation - Software Framework

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
E.S.S.E
Editing Systems Standard Evaluation
Software Framework
G. Della Rocca, M. Di Zio, O. Luzi, A. Manzari
Italian National Statistical Institute
Methodological Studies Office
Procedures flow chart
STEP 2
STEP 1
true
data
Error simulation
procedure
Editing
procedure
raw data
STEP 3
Evaluation procedure
Error simulation
procedure evaluation
Editing procedure
evaluation
clean
data
Logical steps of the procedures
Steps:
• 1: Artificial generation of a raw data file by
randomly introducing errors among true data
• 2: Error localisation and correction by the editing
procedure
• 3: Evaluation of error localisation and error
correction capabilities of the procedure
STEP 1 (mcar)
Objective:
Definition of a project by the introduction of
predefined percentages of non sampling errors
among data on the basis of known error generation
models
Methodological aspects:
• The system can be applied to any survey.
• It handles both quantitative and qualitative
variables.
• It implements the most frequent stochastic or
systematic error generation models
STEP 1.a: Error Generation Models
Models for qualitative and quantitative variables:
• Item non response
• Misplacement errors
• Interchange value
• Routing errors
• Interchange error
STEP 1: Error Generation Models
Models for quantitative variables only:
• Loss or addition of zeroes
• Under-report
• Outliers generation
Model for all variables selected :
• Keying error
STEP 1 Architecture
True
data
Record selection
Models
param.
Variable selection
Error model selection
Simulation
effects
Model application
yes
Other variables
to contaminate?
Raw
data
not
yes
Other records
to contaminate?
not
Stop
STEP 2
Objective:
Error localisation and correction by the
editing procedure
Step 2 architecture:
export the perturbed SAS dataset to the suitable
dataset for the editing procedure
STEP 3
Objective:
Evaluation of the error localisation and
error correction procedures
Step 3 architecture:
• Import correct dataset
• Merge original, perturbed and corrected datasets
• Computation of evaluation indexes
• Graphical visualisation
E.S.S.E.
Editing System Standard Evaluation
Giorgio Della Rocca, Marco Di Zio, Orietta Luzi, Antonia Manzari
Istituto Nazionale di Statistica
Abstract
The main target of the Editing and Imputation (E&I) procedures is to find and correct non-sample errors
in order to improve the quality of the statistical information. In this context we say that an error is
occurred when there is a deviation from the true value of a variable. By using logical and mathematical
constraints (edits) we find some suspect values which can be considered as erroneous and for them a
correction is performed. ESSE is a generalised software, developed in SAS, created to evaluate the E&I
procedures. In fact, from a given data set, it allows to create another data set by introducing the most
frequent types of non-sampling errors keeping under control the incidence of each single typology. It also
allows to introduce missing values according to the most common models of missingness: MCAR
(Missing Completely at Random) and MAR (Missing at Random). The evaluation of a given procedure is
made by means of some ad hoc indicators. Finally a user-friendly interface, developed in SAS/AF, allows
the user to go through the different steps of the software, from the definition of the parameters needed by
ESSE to the analysis of the results.
1.
Introduction
In any statistical process, either survey or census investigation, data are affected by the
presence of non-sampling errors arising from several causes, and in different phases of the
statistical process (data capture, registration, coding, etc). This situation can lead to low quality
estimates of the aggregates of interest, both in terms of bias and variance of the estimator. For this
reason, in order to guarantee a high level of accuracy of the final data, an important step in the
production of statistical information is data editing, which mainly consists of two parts: detect nonsampling errors and eventually correct them (imputation). In particular errors in data can be
immediately detected if they are completeness errors (i.e. item non responses), or domain errors,
and the only problem is how to impute the involved variables. But in case of failure of route edits
(rules corresponding to the instructions for the compilation of the questionnaire), or coherence edits
(logical constraints or statistical relationships among variables), a problem arises to localise errors,
in other words to decide what are the variables responsible for the failure of the edits. In many cases
this can be a non trivial problem. As these problems can be tackled in several different ways, it
would be important to have an evaluation of the proposed editing and imputation procedure (E&I),
in order to choose the most convenient approach.
For the evaluation of an E&I procedure three different types of data-sets are needed:
1.
2.
3.
a set of true data;
a set of raw data;
a set of edited data, arising from the proposed procedure.
Opportune comparisons of the previous sets lead us to an evaluation of the data editing
procedure proposed.
The generalised software ESSE is based on the simulation of the possible situations that can
be encountered in the reality: a set of true data, which means without errors with respect to a set of
edit rules, is artificially perturbed by means of a controlled process of errors generation. This
produces a set of raw data, which mimics in a real situations the data collected for statistical
purposes, i.e. data affected by non-sampling errors. Raw data are submitted to the proposed data
editing procedure, obtaining the edited data, which in a real situation represent the final data. The
figure 1 shows the main three phases of the evaluation process for an E&I procedure:
1. error simulation;
2. editing and imputation of errors;
3. evaluation of the E&I procedure.
ESSE is a generalised software which allows the researcher to manage with simulation and
evaluation phases needed for the evaluation of an E&I procedure. It is worth writing that this
software has been officially selected as the system to be used for the evaluation of Editing and
Imputation procedure in the European project EUREDIT1.
True
Data
Error
Simulation
Raw
Data
Editing and
Imputation
Edited
Data
Simulation
Evaluation
E&I evaluation
Evaluation Procedure
Figure 1. Flow chart of the evaluation procedure
2.
Error simulation
The procedure for the simulation of raw data (perturbed data) is based on the introduction
of artificial errors in the set of true data. The user can control the impact on the data of artificial
errors, by specifying the rate of error. Actually in ESSE two strategies are available, i.e. MCAR and
MAR error simulation. The two strategies, as the names suggest, are mainly distinct for the phase of
item non-response generation: in fact in the first model the missing values are generated according
to the MCAR model (Missing Completely At Random), while for the second we chose to act in the
MAR (Missing At Random) context; details can be found in Rubin, 1987 and Schafer, 1997.
1
Founded by the European Community in the 5th Framework Programme for Research in Official Statistics
2.1
MCAR error simulation
This strategy allows the user to introduce several typologies of errors, which can be further
grouped with respect to the nature of the variable to be treated, i.e. qualitative variables, quantitative
variables, further details are given in Luzi et al.,1998.
1.
Typology of errors for both qualitative and quantitative variables:
Item non-response: the true value is replaced by a missing value;
The item non-response model generates missing values on the basis of a Missing Completely At
Random mechanism, that is the probability of a value to be missing is independent of the values in
the hypothetical complete data set.
Misplacement errors: for two adjacent variables only (of the same type): the two values are
swapped;
Interchange of value: if the variable length is one digit, the true value is replaced by a wrong one
chosen in the domain; if the variable length is more than one digit, a random couple of digits is
chosen and swapped;
Interchange errors: the true value is replaced by a wrong one chosen in the domain;
Routing errors: implemented to perturb responses at skip (or routing) instructions that is when the
answer to a variable (dependent) is required only for some values of another variable (filter). The
user has to define the condition which requires the answer to the dependent variable.
The dependent variable value is set to missing for each unit whose condition is true.
A value randomly chosen in the domain is assigned to the dependent variable for each unit whose
condition is false.
2.
Typology of errors only for quantitative variables:
Loss or addition of zeroes: a numerical value ending in zero is rewritten either by adding a zero or
by dropping a zero;
Under-report error: a numerical value is replaced by a wrong nonnegative value lower than the true
one.
Outliers generation: this model generation is still in progress, the original value is substituted by a
value belonging to the range (max , 1.05 max), where max is the maximum value observed.
3.
Overall error:
Keying error: a digit is replaced by a wrong numeric digit (value lying in the interval [0-9]). This is
called an overall error because it can affect randomly all the variables chosen.
It is important to stress that the user has the possibility to choose the error rate
independently both for each single error generation model, and for each variable. In figure 2 an
example of the frame implemented in ESSE for the parameters definition phase is shown.
2.2
MAR error simulation
This is a separate module to generate item non-response according to a model based on the
MAR mechanism, that is the probability of an item to be missing does not depend on the value of
the item itself, although it can depend on observed data of other variables. Actually this module is
disjoint with MCAR error simulation module, which means that it is not possible to introduce, at the
same time, either MAR item non-response and MCAR item non-response, or MAR item nonresponse and the several kind of errors previously explained. The MAR mechanism has been
implemented by modelling the expected probability of occurrence of a missing entry by means of a
logistic model (Agresti, 1990).
Let D be a complete data set, our goal is to introduce some missing values in the Y variable.
Let the probability of an item of Y to be missing, be dependent on observed data of the X1,…, Xk
variables. According to this model, the probability to be missing for the Y variable in the jth unit,
given the observed values x1j,…, xkj , is:
$0 "
P(Yj =missing |
x1j,…, xkj) =
k
! $ i x ij
i #1
e
1" e
$0 "
k
!
(1)
$ i x ij
i #1
The expected probability to be missing for the Y variable in the data set is given by
averaging the probability to be missing for each unit j, that is:
_
P = P(Y = missing) =
1 n
! P(Y j # missing ) .
n j #1
In order to generate missing values, the user have to set up into the ESSE software the
variable Y to be perturbed, and the k explanatory variables X1,…, Xk, paying attention if some of
the variables are categorical. In this case, they have to be rewritten by using dummy variables
(Hosmer et al. 1989). Suppose that the hth independent variable Xh has sj levels. The sj - 1 dummy
variables will be denoted as Dju and the coefficients for these dummy variables will be denoted as
$ju , u=1, 2, …, sj - 1, thus in the general equation (1) we have that:
s j %1
$ 0 " $ 1 x1 " ... " $ j x j " ... " $ k x k # $ 0 " $ 1 x1 " ... " ! $ ju D ju " ... " $ k x k ,
u #1
where Dju can be either 0 or 1. For example let Xj be a variable with three possible categories,
which has been coded as "a", "b", "c". One possible coding strategy is introduce the two design
variables D1 and D2 and set all the possible choices in this way:
Dummy
Xj
a
b
c
D1
0
1
0
Variables
D2
0
0
1
Completed the transformations required, the system will start with some default values
assigned to the k+1 parameters $&. The user have the possibility to change them, until the situation
is satisfactory. The system helps the user in deciding which values to assign to the parameters, in
fact for each vector of values given to the parameters $& , the software shows the average probability
P to be missing, and draws a graph showing the probability to be missing for each single unit
P(Yj =missing | X1j,…, Xkj) (figure 4). This can help to better understand the composition and the
concentration of the probabilities.
However in the definition of the $& parameters, the user have to keep in mind their meaning,
in this situation the interpretation of the coefficients can be a useful hint. Let it be
'(x) = P( Y = missing | x )
where x = (X1,…, Xk ).
By using the logit transformation we obtain:
k
. ' ( x) +
)) = $ 0 " ! $ i xi .
g(x) = ln ,,
i #1
- 1 % ' ( x) *
For example, let us suppose that we have just one continuous covariate x; the $1 coefficient
gives a measure of the change in the log odds ratio (a sort of relative risk)
. ' (x " / )
' ( x) +
) where / 1 0 , for an increase of / unit in x, that is
g ( x " / ) % g ( x) # ln,,
1 % ' ( x) )*
-1 % ' (x " / )
$23#3g(x+/)-g(x) for any value of x.
Once the needed parameters have been specified, the system will provide to introduce the
missing values on the basis of the model so far introduced; moreover, even after the simulation
phase is applied, the user has the possibility to restart this step changing the $& values.
3.
Editing and Imputation phase
The localisation of errors and consequently their correction is actually an external phase to
the ESSE software. There are several generalised software for this target, in particular the most used
in ISTAT are:
SCIA: a software developed by ISTAT in UNIX and WINDOWS environment, and written in
FORTRAN. It works only with qualitative variables;
GEIS: developed by Statistics Canada in UNIX and ORACLE environment. It works only with
quantitative variables;
RIDA: a software developed by ISTAT in UNIX environment. It is used for the imputation phase.
Since our goal is to compare different E&I procedure, we gave the possibility to
export/import data-sets in the most used formats: text format (*.TXT), excell format (*.XLS), DBF
format (*.DBF), comma separator format (*.CSV), fixed column format (*.DAT) and SAS format.
4.
Evaluation of an E&I procedure
The quality of an E&I procedure can be evaluated according two points of view (Granquist,
1997; Stefanowicz, 1997):
a) in terms of the distance between the estimates computed on the true values and the estimates
computed on the edited values;
b) in terms of the number of detected, undetected and introduced errors.
In the first approach, the study is focused on the effect of editing on reported data (output oriented
approach), while in the latter study is focused on the effect of editing on individual data items
(input oriented approach).
We focus our attention on the evaluation of the quality of an E&I process in terms of its
capability to recognise errors and adequately replace them with the true values: the higher the
proportion of corrected errors on the total, and the less the quantity of new errors introduced, the
more accurate is the process. In other words, our evaluation is not estimates oriented: the accuracy
of an E&I process is not measured by comparing the statistical estimates obtained from clean data
with the corresponding values resulting from the true data.
4.1
Quality of editing process
Let us consider whichever raw value of a generic variable. We can regard the editing
decision as the result of a screening procedure designed to detect deviations of the raw values from
the true values. Accordingly to the basic concepts of diagnostic tests (Armitage et al., 1971),
considering all the set of values assumed by the variable in the data set, we can define:
4 : the proportion of false positives, i.e. not perturbed values (true values)considered as erroneous;
$ : the proportion of false negatives, i.e. perturbed values not recognised as erroneous.
The analogy of editing decisions with diagnostic tests allows us to resort to familiar concepts: the
probability to recognise not perturbed values as true is analogous to the specificity (1-4), while the
probability to recognise perturbed values as erroneous is analogous to the sensitivity (1-$). The
extent of (1-4) and (1-$) indicates the reliability of the editing process in assessing the correctness
of true and perturbed values.
We can estimate these probabilities by applying the editing process to a set of known
perturbed and true data. Suppose that, for a generic variable, the application gives the following
frequencies:
Editing classification
erroneous
true
a
b
Original perturbed
data
c
d
not perturbed
Where:
a = number of perturbed data classified by the editing process as erroneous,
b = number of perturbed data classified by the editing process as true,
c = number of true (not perturbed) data classified by the editing process as erroneous,
d = number of true (not perturbed) data classified by the editing process as true.
Cases placed on the secondary diagonal represent the failures of the editing process.
The ratio c/(c+d) measures the failures of the editing process for true data and can be used to
estimate 4.
The ratio b/(a+b) measures the failure of the editing process for perturbed data and estimates $.
Ratios
E_tru = d/(c+d)
and
estimate (1-4) and (1-$) respectively.
E_mod = a/(a+b)
With regard to any single variable, for the whole set of data (perturbed and true) the
accuracy of the editing process is measured by the fraction of total cases that are correctly
classified:
E_tot = (a+d) / (a+b+c+d).
The reader can easily verify that this is a linear combination of E_tru and E_mod, whose weights
are the fraction of the total cases that are true and the fraction of total cases that are perturbed:
E_tot # E_tru
(c " d )
( a " b)
" E _ mod
.
(a " b " c " d )
(a " b " c " d )
Therefore, its value can be strongly affected by the error level in data.
4.2
Quality of imputation process
We impose that the imputation process imputes only values previously classified by the
editing process as erroneous. The new assigned value can be equal to the original one or different.
In the first case the imputation process can be deemed as successful, in the latter case we can say
that the imputation process fails.
Actually, in the case of quantitative variables, it is not necessary that the imputed value
precisely equals the original value to consider the imputation as successful: it could be sufficient
that the new value lies in an interval whose centre is the original value. In the case of qualitative
variables, if the value classified as erroneous is a true one, the imputation process always fails.
We now refer to the general case of both qualitative and numeric variables. Referring to the
previous table, a and c can be decomposed in
a = as + af
and
c = cs + cf
where:
as = number of perturbed data classified by the editing process as erroneous and successfully
imputed,
af = number of perturbed data classified by the editing process as erroneous and imputed without
success,
cs = number of true data classified by the editing process as erroneous and successfully imputed,
cf = number of true data classified by the editing process as erroneous and imputed without success.
We can decompose the previous table in:
E&I classification
Erroneous
Successfully Imputed Not Successfully Imputed
Original
Data
True
Perturbed
ass
af
b
Not perturbed
cs
cf
d
The quality of the imputation process can be evaluated by the fraction of imputed data for
which the imputation process is successful. For imputed perturbed values we compute:
I_mod = as /a.
The fraction of imputed true values for which the imputation is successful, i.e.
I_tru = cs /c
is not to be considered to evaluate the inner quality of the imputation process, because imputing
equals to change the raw value and cs /c is only an artificial result due to the definition of successful
imputation for numeric variables. The same observation is valid for fraction of imputed total values,
i.e.
I_tot=(as+cs)/(a+c)
because it under-estimates the inner quality of the imputation process.
4.3
The overall quality of edit and imputation process
We can now consider the whole E&I process. The accuracy of the E&I process with regard
to true data is measured by estimating the probability of not introducing new errors in data. It is the
total probability of two different events: to classify as true a true value or to impute a value close to
the true one in case of incorrect classification. It can be estimated by:
E&I_tru = E_tru + [(1-E_tru) (I_tru)] = (cs+d) / (c+d).
The accuracy of the E&I process for perturbed data is measured by estimating the probability to
correct perturbed data. It is a joint probability of a combination of two results: to classify as
erroneous a perturbed value and to restore the true value. It can be estimated by:
E&I_mod = (E_mod) (I_mod) = as / (a+b).
For the whole set of data (perturbed and true) the accuracy of the E&I process is measured by the
fraction of total cases whose true value is correctly restored: E&I_tot = (as+cs+d) / (a+b+c+d).
Even in this case, it is easy to verify:
E & I_tot # (E & I_tru )
(c " d )
( a " b)
" (E & I_mod)
.
(a " b " c " d )
(a " b " c " d )
The indices defined above are summarised in Table 1.
Table 1. Accuracy indices
Class of data
Index
Editing accuracy
true data
E_tru: fraction of true data that are correctly
classified
perturbed data
E_mod: fraction of perturbed data that are
correctly classified
total data
E_tot: fraction of total data that are correctly
classified
Imputation accuracy
true data
I_tru: fraction of imputed true data whose
true value is correctly restored
perturbed data
I_mod: fraction of imputed perturbed data
whose true value is correctly restored
total data
I_tot: fraction of imputed total data whose
true value is correctly restored
E&I accuracy
true data
E&I_tru: fraction of true data whose true
value is correctly restored
perturbed data
E&I_mod: fraction of perturbed data whose
true value is correctly restored
total data
E&I_tot: fraction of total data whose true
value is correctly restored
Calculation
d/(c+d)
a/(a+b)
(a+d)/(a+b+c+d)
cs /c
as /a
(as+cs )/(a+c)
(cs+d)/(c+d)
as /(a+b)
(as+cs+d)/(a+b+c+d)
Each index defined in Table 1 ranges from 0 (no accuracy) to 1 (maximum accuracy).
5.
Future developments
Regarding future developments, the authors are planning to implement additional feature in
the software in order to:
(i) permit the automated iteration of the whole test process (Error simulation/E&I/Evaluation) for
whatever number of times, in order to estimate mean values and variances of the defined
indicators;
(ii) carry out the evaluation of the quality of an E&I procedure also in terms of comparison
between statistics computed from the true values and those computed by the edited values
(estimates oriented evaluation);
(iii) to allow the user to use at the same time the two error models MAR and MCAR so far
introduced.
REFERENCES
Agresti A. (1990). Categorical Data Analysis. New York: John Wiley & Sons, Inc.
Armitage P., Berry G. (1971) Statistical methods in medical research. Oxford, London, Edinburgh, Boston, Melbourne:
Blackwell Scientific Publications.
Granquist L. (1997) An overview of methods of evaluating data editing procedures, In Statistical Data Editing, Methods
and Techniques, Vol. 2. Statistical Standard and Studies No 48, UN/ECE, 112-123.
Hosmer D. W., Lemeshow S. (1989). Applied Logistic Regression. John Wiley & Sons, Inc.
Luzi O., Della Rocca G. (1998) A Generalised Error Simulation System to Test the Performance of Editing Procedures,
Proceedings of the SEUGI 16, Prague 9-12 June 1998.
Rubin D. B. (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc.
Schafer J. L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall.
Stefanowicz B. (1997) Selected issues of data editing, In Statistical Data Editing, Methods and Techniques, Vol. 2.
Statistical Standard and Studies No 48, UN/ECE, 109-111.
Figure 2. Set up frame for perturbation parameters
Figure 3. Evaluation Indices frame
Figure 4. Set up parameters frame.