* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download E.S.S.E. Editing Systems Standard Evaluation - Software Framework
Survey
Document related concepts
Transcript
E.S.S.E Editing Systems Standard Evaluation Software Framework G. Della Rocca, M. Di Zio, O. Luzi, A. Manzari Italian National Statistical Institute Methodological Studies Office Procedures flow chart STEP 2 STEP 1 true data Error simulation procedure Editing procedure raw data STEP 3 Evaluation procedure Error simulation procedure evaluation Editing procedure evaluation clean data Logical steps of the procedures Steps: • 1: Artificial generation of a raw data file by randomly introducing errors among true data • 2: Error localisation and correction by the editing procedure • 3: Evaluation of error localisation and error correction capabilities of the procedure STEP 1 (mcar) Objective: Definition of a project by the introduction of predefined percentages of non sampling errors among data on the basis of known error generation models Methodological aspects: • The system can be applied to any survey. • It handles both quantitative and qualitative variables. • It implements the most frequent stochastic or systematic error generation models STEP 1.a: Error Generation Models Models for qualitative and quantitative variables: • Item non response • Misplacement errors • Interchange value • Routing errors • Interchange error STEP 1: Error Generation Models Models for quantitative variables only: • Loss or addition of zeroes • Under-report • Outliers generation Model for all variables selected : • Keying error STEP 1 Architecture True data Record selection Models param. Variable selection Error model selection Simulation effects Model application yes Other variables to contaminate? Raw data not yes Other records to contaminate? not Stop STEP 2 Objective: Error localisation and correction by the editing procedure Step 2 architecture: export the perturbed SAS dataset to the suitable dataset for the editing procedure STEP 3 Objective: Evaluation of the error localisation and error correction procedures Step 3 architecture: • Import correct dataset • Merge original, perturbed and corrected datasets • Computation of evaluation indexes • Graphical visualisation E.S.S.E. Editing System Standard Evaluation Giorgio Della Rocca, Marco Di Zio, Orietta Luzi, Antonia Manzari Istituto Nazionale di Statistica Abstract The main target of the Editing and Imputation (E&I) procedures is to find and correct non-sample errors in order to improve the quality of the statistical information. In this context we say that an error is occurred when there is a deviation from the true value of a variable. By using logical and mathematical constraints (edits) we find some suspect values which can be considered as erroneous and for them a correction is performed. ESSE is a generalised software, developed in SAS, created to evaluate the E&I procedures. In fact, from a given data set, it allows to create another data set by introducing the most frequent types of non-sampling errors keeping under control the incidence of each single typology. It also allows to introduce missing values according to the most common models of missingness: MCAR (Missing Completely at Random) and MAR (Missing at Random). The evaluation of a given procedure is made by means of some ad hoc indicators. Finally a user-friendly interface, developed in SAS/AF, allows the user to go through the different steps of the software, from the definition of the parameters needed by ESSE to the analysis of the results. 1. Introduction In any statistical process, either survey or census investigation, data are affected by the presence of non-sampling errors arising from several causes, and in different phases of the statistical process (data capture, registration, coding, etc). This situation can lead to low quality estimates of the aggregates of interest, both in terms of bias and variance of the estimator. For this reason, in order to guarantee a high level of accuracy of the final data, an important step in the production of statistical information is data editing, which mainly consists of two parts: detect nonsampling errors and eventually correct them (imputation). In particular errors in data can be immediately detected if they are completeness errors (i.e. item non responses), or domain errors, and the only problem is how to impute the involved variables. But in case of failure of route edits (rules corresponding to the instructions for the compilation of the questionnaire), or coherence edits (logical constraints or statistical relationships among variables), a problem arises to localise errors, in other words to decide what are the variables responsible for the failure of the edits. In many cases this can be a non trivial problem. As these problems can be tackled in several different ways, it would be important to have an evaluation of the proposed editing and imputation procedure (E&I), in order to choose the most convenient approach. For the evaluation of an E&I procedure three different types of data-sets are needed: 1. 2. 3. a set of true data; a set of raw data; a set of edited data, arising from the proposed procedure. Opportune comparisons of the previous sets lead us to an evaluation of the data editing procedure proposed. The generalised software ESSE is based on the simulation of the possible situations that can be encountered in the reality: a set of true data, which means without errors with respect to a set of edit rules, is artificially perturbed by means of a controlled process of errors generation. This produces a set of raw data, which mimics in a real situations the data collected for statistical purposes, i.e. data affected by non-sampling errors. Raw data are submitted to the proposed data editing procedure, obtaining the edited data, which in a real situation represent the final data. The figure 1 shows the main three phases of the evaluation process for an E&I procedure: 1. error simulation; 2. editing and imputation of errors; 3. evaluation of the E&I procedure. ESSE is a generalised software which allows the researcher to manage with simulation and evaluation phases needed for the evaluation of an E&I procedure. It is worth writing that this software has been officially selected as the system to be used for the evaluation of Editing and Imputation procedure in the European project EUREDIT1. True Data Error Simulation Raw Data Editing and Imputation Edited Data Simulation Evaluation E&I evaluation Evaluation Procedure Figure 1. Flow chart of the evaluation procedure 2. Error simulation The procedure for the simulation of raw data (perturbed data) is based on the introduction of artificial errors in the set of true data. The user can control the impact on the data of artificial errors, by specifying the rate of error. Actually in ESSE two strategies are available, i.e. MCAR and MAR error simulation. The two strategies, as the names suggest, are mainly distinct for the phase of item non-response generation: in fact in the first model the missing values are generated according to the MCAR model (Missing Completely At Random), while for the second we chose to act in the MAR (Missing At Random) context; details can be found in Rubin, 1987 and Schafer, 1997. 1 Founded by the European Community in the 5th Framework Programme for Research in Official Statistics 2.1 MCAR error simulation This strategy allows the user to introduce several typologies of errors, which can be further grouped with respect to the nature of the variable to be treated, i.e. qualitative variables, quantitative variables, further details are given in Luzi et al.,1998. 1. Typology of errors for both qualitative and quantitative variables: Item non-response: the true value is replaced by a missing value; The item non-response model generates missing values on the basis of a Missing Completely At Random mechanism, that is the probability of a value to be missing is independent of the values in the hypothetical complete data set. Misplacement errors: for two adjacent variables only (of the same type): the two values are swapped; Interchange of value: if the variable length is one digit, the true value is replaced by a wrong one chosen in the domain; if the variable length is more than one digit, a random couple of digits is chosen and swapped; Interchange errors: the true value is replaced by a wrong one chosen in the domain; Routing errors: implemented to perturb responses at skip (or routing) instructions that is when the answer to a variable (dependent) is required only for some values of another variable (filter). The user has to define the condition which requires the answer to the dependent variable. The dependent variable value is set to missing for each unit whose condition is true. A value randomly chosen in the domain is assigned to the dependent variable for each unit whose condition is false. 2. Typology of errors only for quantitative variables: Loss or addition of zeroes: a numerical value ending in zero is rewritten either by adding a zero or by dropping a zero; Under-report error: a numerical value is replaced by a wrong nonnegative value lower than the true one. Outliers generation: this model generation is still in progress, the original value is substituted by a value belonging to the range (max , 1.05 max), where max is the maximum value observed. 3. Overall error: Keying error: a digit is replaced by a wrong numeric digit (value lying in the interval [0-9]). This is called an overall error because it can affect randomly all the variables chosen. It is important to stress that the user has the possibility to choose the error rate independently both for each single error generation model, and for each variable. In figure 2 an example of the frame implemented in ESSE for the parameters definition phase is shown. 2.2 MAR error simulation This is a separate module to generate item non-response according to a model based on the MAR mechanism, that is the probability of an item to be missing does not depend on the value of the item itself, although it can depend on observed data of other variables. Actually this module is disjoint with MCAR error simulation module, which means that it is not possible to introduce, at the same time, either MAR item non-response and MCAR item non-response, or MAR item nonresponse and the several kind of errors previously explained. The MAR mechanism has been implemented by modelling the expected probability of occurrence of a missing entry by means of a logistic model (Agresti, 1990). Let D be a complete data set, our goal is to introduce some missing values in the Y variable. Let the probability of an item of Y to be missing, be dependent on observed data of the X1,…, Xk variables. According to this model, the probability to be missing for the Y variable in the jth unit, given the observed values x1j,…, xkj , is: $0 " P(Yj =missing | x1j,…, xkj) = k ! $ i x ij i #1 e 1" e $0 " k ! (1) $ i x ij i #1 The expected probability to be missing for the Y variable in the data set is given by averaging the probability to be missing for each unit j, that is: _ P = P(Y = missing) = 1 n ! P(Y j # missing ) . n j #1 In order to generate missing values, the user have to set up into the ESSE software the variable Y to be perturbed, and the k explanatory variables X1,…, Xk, paying attention if some of the variables are categorical. In this case, they have to be rewritten by using dummy variables (Hosmer et al. 1989). Suppose that the hth independent variable Xh has sj levels. The sj - 1 dummy variables will be denoted as Dju and the coefficients for these dummy variables will be denoted as $ju , u=1, 2, …, sj - 1, thus in the general equation (1) we have that: s j %1 $ 0 " $ 1 x1 " ... " $ j x j " ... " $ k x k # $ 0 " $ 1 x1 " ... " ! $ ju D ju " ... " $ k x k , u #1 where Dju can be either 0 or 1. For example let Xj be a variable with three possible categories, which has been coded as "a", "b", "c". One possible coding strategy is introduce the two design variables D1 and D2 and set all the possible choices in this way: Dummy Xj a b c D1 0 1 0 Variables D2 0 0 1 Completed the transformations required, the system will start with some default values assigned to the k+1 parameters $&. The user have the possibility to change them, until the situation is satisfactory. The system helps the user in deciding which values to assign to the parameters, in fact for each vector of values given to the parameters $& , the software shows the average probability P to be missing, and draws a graph showing the probability to be missing for each single unit P(Yj =missing | X1j,…, Xkj) (figure 4). This can help to better understand the composition and the concentration of the probabilities. However in the definition of the $& parameters, the user have to keep in mind their meaning, in this situation the interpretation of the coefficients can be a useful hint. Let it be '(x) = P( Y = missing | x ) where x = (X1,…, Xk ). By using the logit transformation we obtain: k . ' ( x) + )) = $ 0 " ! $ i xi . g(x) = ln ,, i #1 - 1 % ' ( x) * For example, let us suppose that we have just one continuous covariate x; the $1 coefficient gives a measure of the change in the log odds ratio (a sort of relative risk) . ' (x " / ) ' ( x) + ) where / 1 0 , for an increase of / unit in x, that is g ( x " / ) % g ( x) # ln,, 1 % ' ( x) )* -1 % ' (x " / ) $23#3g(x+/)-g(x) for any value of x. Once the needed parameters have been specified, the system will provide to introduce the missing values on the basis of the model so far introduced; moreover, even after the simulation phase is applied, the user has the possibility to restart this step changing the $& values. 3. Editing and Imputation phase The localisation of errors and consequently their correction is actually an external phase to the ESSE software. There are several generalised software for this target, in particular the most used in ISTAT are: SCIA: a software developed by ISTAT in UNIX and WINDOWS environment, and written in FORTRAN. It works only with qualitative variables; GEIS: developed by Statistics Canada in UNIX and ORACLE environment. It works only with quantitative variables; RIDA: a software developed by ISTAT in UNIX environment. It is used for the imputation phase. Since our goal is to compare different E&I procedure, we gave the possibility to export/import data-sets in the most used formats: text format (*.TXT), excell format (*.XLS), DBF format (*.DBF), comma separator format (*.CSV), fixed column format (*.DAT) and SAS format. 4. Evaluation of an E&I procedure The quality of an E&I procedure can be evaluated according two points of view (Granquist, 1997; Stefanowicz, 1997): a) in terms of the distance between the estimates computed on the true values and the estimates computed on the edited values; b) in terms of the number of detected, undetected and introduced errors. In the first approach, the study is focused on the effect of editing on reported data (output oriented approach), while in the latter study is focused on the effect of editing on individual data items (input oriented approach). We focus our attention on the evaluation of the quality of an E&I process in terms of its capability to recognise errors and adequately replace them with the true values: the higher the proportion of corrected errors on the total, and the less the quantity of new errors introduced, the more accurate is the process. In other words, our evaluation is not estimates oriented: the accuracy of an E&I process is not measured by comparing the statistical estimates obtained from clean data with the corresponding values resulting from the true data. 4.1 Quality of editing process Let us consider whichever raw value of a generic variable. We can regard the editing decision as the result of a screening procedure designed to detect deviations of the raw values from the true values. Accordingly to the basic concepts of diagnostic tests (Armitage et al., 1971), considering all the set of values assumed by the variable in the data set, we can define: 4 : the proportion of false positives, i.e. not perturbed values (true values)considered as erroneous; $ : the proportion of false negatives, i.e. perturbed values not recognised as erroneous. The analogy of editing decisions with diagnostic tests allows us to resort to familiar concepts: the probability to recognise not perturbed values as true is analogous to the specificity (1-4), while the probability to recognise perturbed values as erroneous is analogous to the sensitivity (1-$). The extent of (1-4) and (1-$) indicates the reliability of the editing process in assessing the correctness of true and perturbed values. We can estimate these probabilities by applying the editing process to a set of known perturbed and true data. Suppose that, for a generic variable, the application gives the following frequencies: Editing classification erroneous true a b Original perturbed data c d not perturbed Where: a = number of perturbed data classified by the editing process as erroneous, b = number of perturbed data classified by the editing process as true, c = number of true (not perturbed) data classified by the editing process as erroneous, d = number of true (not perturbed) data classified by the editing process as true. Cases placed on the secondary diagonal represent the failures of the editing process. The ratio c/(c+d) measures the failures of the editing process for true data and can be used to estimate 4. The ratio b/(a+b) measures the failure of the editing process for perturbed data and estimates $. Ratios E_tru = d/(c+d) and estimate (1-4) and (1-$) respectively. E_mod = a/(a+b) With regard to any single variable, for the whole set of data (perturbed and true) the accuracy of the editing process is measured by the fraction of total cases that are correctly classified: E_tot = (a+d) / (a+b+c+d). The reader can easily verify that this is a linear combination of E_tru and E_mod, whose weights are the fraction of the total cases that are true and the fraction of total cases that are perturbed: E_tot # E_tru (c " d ) ( a " b) " E _ mod . (a " b " c " d ) (a " b " c " d ) Therefore, its value can be strongly affected by the error level in data. 4.2 Quality of imputation process We impose that the imputation process imputes only values previously classified by the editing process as erroneous. The new assigned value can be equal to the original one or different. In the first case the imputation process can be deemed as successful, in the latter case we can say that the imputation process fails. Actually, in the case of quantitative variables, it is not necessary that the imputed value precisely equals the original value to consider the imputation as successful: it could be sufficient that the new value lies in an interval whose centre is the original value. In the case of qualitative variables, if the value classified as erroneous is a true one, the imputation process always fails. We now refer to the general case of both qualitative and numeric variables. Referring to the previous table, a and c can be decomposed in a = as + af and c = cs + cf where: as = number of perturbed data classified by the editing process as erroneous and successfully imputed, af = number of perturbed data classified by the editing process as erroneous and imputed without success, cs = number of true data classified by the editing process as erroneous and successfully imputed, cf = number of true data classified by the editing process as erroneous and imputed without success. We can decompose the previous table in: E&I classification Erroneous Successfully Imputed Not Successfully Imputed Original Data True Perturbed ass af b Not perturbed cs cf d The quality of the imputation process can be evaluated by the fraction of imputed data for which the imputation process is successful. For imputed perturbed values we compute: I_mod = as /a. The fraction of imputed true values for which the imputation is successful, i.e. I_tru = cs /c is not to be considered to evaluate the inner quality of the imputation process, because imputing equals to change the raw value and cs /c is only an artificial result due to the definition of successful imputation for numeric variables. The same observation is valid for fraction of imputed total values, i.e. I_tot=(as+cs)/(a+c) because it under-estimates the inner quality of the imputation process. 4.3 The overall quality of edit and imputation process We can now consider the whole E&I process. The accuracy of the E&I process with regard to true data is measured by estimating the probability of not introducing new errors in data. It is the total probability of two different events: to classify as true a true value or to impute a value close to the true one in case of incorrect classification. It can be estimated by: E&I_tru = E_tru + [(1-E_tru) (I_tru)] = (cs+d) / (c+d). The accuracy of the E&I process for perturbed data is measured by estimating the probability to correct perturbed data. It is a joint probability of a combination of two results: to classify as erroneous a perturbed value and to restore the true value. It can be estimated by: E&I_mod = (E_mod) (I_mod) = as / (a+b). For the whole set of data (perturbed and true) the accuracy of the E&I process is measured by the fraction of total cases whose true value is correctly restored: E&I_tot = (as+cs+d) / (a+b+c+d). Even in this case, it is easy to verify: E & I_tot # (E & I_tru ) (c " d ) ( a " b) " (E & I_mod) . (a " b " c " d ) (a " b " c " d ) The indices defined above are summarised in Table 1. Table 1. Accuracy indices Class of data Index Editing accuracy true data E_tru: fraction of true data that are correctly classified perturbed data E_mod: fraction of perturbed data that are correctly classified total data E_tot: fraction of total data that are correctly classified Imputation accuracy true data I_tru: fraction of imputed true data whose true value is correctly restored perturbed data I_mod: fraction of imputed perturbed data whose true value is correctly restored total data I_tot: fraction of imputed total data whose true value is correctly restored E&I accuracy true data E&I_tru: fraction of true data whose true value is correctly restored perturbed data E&I_mod: fraction of perturbed data whose true value is correctly restored total data E&I_tot: fraction of total data whose true value is correctly restored Calculation d/(c+d) a/(a+b) (a+d)/(a+b+c+d) cs /c as /a (as+cs )/(a+c) (cs+d)/(c+d) as /(a+b) (as+cs+d)/(a+b+c+d) Each index defined in Table 1 ranges from 0 (no accuracy) to 1 (maximum accuracy). 5. Future developments Regarding future developments, the authors are planning to implement additional feature in the software in order to: (i) permit the automated iteration of the whole test process (Error simulation/E&I/Evaluation) for whatever number of times, in order to estimate mean values and variances of the defined indicators; (ii) carry out the evaluation of the quality of an E&I procedure also in terms of comparison between statistics computed from the true values and those computed by the edited values (estimates oriented evaluation); (iii) to allow the user to use at the same time the two error models MAR and MCAR so far introduced. REFERENCES Agresti A. (1990). Categorical Data Analysis. New York: John Wiley & Sons, Inc. Armitage P., Berry G. (1971) Statistical methods in medical research. Oxford, London, Edinburgh, Boston, Melbourne: Blackwell Scientific Publications. Granquist L. (1997) An overview of methods of evaluating data editing procedures, In Statistical Data Editing, Methods and Techniques, Vol. 2. Statistical Standard and Studies No 48, UN/ECE, 112-123. Hosmer D. W., Lemeshow S. (1989). Applied Logistic Regression. John Wiley & Sons, Inc. Luzi O., Della Rocca G. (1998) A Generalised Error Simulation System to Test the Performance of Editing Procedures, Proceedings of the SEUGI 16, Prague 9-12 June 1998. Rubin D. B. (1987) Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, Inc. Schafer J. L. (1997) Analysis of Incomplete Multivariate Data. Chapman & Hall. Stefanowicz B. (1997) Selected issues of data editing, In Statistical Data Editing, Methods and Techniques, Vol. 2. Statistical Standard and Studies No 48, UN/ECE, 109-111. Figure 2. Set up frame for perturbation parameters Figure 3. Evaluation Indices frame Figure 4. Set up parameters frame.