Download New and Emerging

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Topic (vi): New and Emerging
Methods
Topic organizer: Maria Garcia (USA)
UNECE Work Session on Statistical Data Editing
Oslo, Norway, 24-26 September 2012
Topic (vi): Overview
 Papers under this topic present new ideas and
advancements in the development of methods
and techniques for solving, improving, and
optimizing the editing and imputation of data.
 Contributions cover:
– Probability editing
– Machine learning methods
– Model-based imputation methods
– Automatic editing of numerical data
Topic (vi): Overview
 Probability editing
WP.36 (Sweden)
– Select a subset of units to edit using a
probability sampling framework.
 Machine learning methods
WP.37 (EUROSTAT)
– Imputation of categorical data using a
neural networks classifier and a Bayesian
networks classifier
Topic (vi): Overview
 Model-based imputation
WP.38 (Slovenia)
– Bayesian model + linear regression
– Multiple imputation
 Automatic editing
WP.39 (Netherlands)
– Ensure hard edits are satisfied while
incorporating information from soft edits into
the editing/imputation process.
Topic (iii): New and Emerging
Methods
Enjoy the presentations!
Topic (vi): New and Emerging
Methods
Summary of main developments
and points for discussion
Topic (vi): Summary
 WP.36 – Probability Editing (Sweden)
– Propose selecting units for editing using a
traditional probability sampling framework in
which only a fraction of the data is edited
– Applies to all types of data
– Address statistical properties of the
estimators
Topic (vi): Summary
 WP.37 – Use of machine learning methods to
impute categorical data (EUROSTAT)
– New neural networks classifier supervised learning
method extended to deal with mixed numerical and
categorical data.
– Bayesian network classifier
– Compare machine learning results with results
obtained from logistic regression and multiple
imputation.
Topic (vi): Summary
 WP.38 – Implementation of the Bayesian
approach to imputation at SORS (Slovenia)
– Bayesian model and linear regression combined
into a method for imputing annual gross income in a
household survey.
– Solve the imputation problem within separate data
groupings with different levels of the auxiliary
variables within each group.
– Multiple imputation.
Topic (vi): Summary
 WP.39 – Automatic editing with hard and
soft edits (Netherlands)
– Error localization problem involving both hard
(inconsistency) and soft (query) edits.
– Minimizing the number of fields to impute
incorporates the cost associated with failed
query edits.
– Associated software written in R, uses an
existing R package.
Topic (vi): Points for discussion
In probability editing, a sampling design is used to
select units for editing. If not using the two step
approach, important units may not be in the sample
and errors (possibly large) may remain in the data file.
What are the implications?
How will this affect the estimator and the variance of the
estimator?
How do errors still remaining in the data affect analysis,
particularly if the data is used for other purposes not
envisioned in the original survey design?
Topic (vi): Points for discussion
The paper from EUROSTAT deals with the use of
machine learning methods for imputing missing data:
 The method uses either a Bayesian network classifier or a
neural network classifier. What is the effect of using
these methods for imputing missing values on the
original distribution of the variables?
 Choosing the appropriate method for handling
imputation of missing values is a challenging problem.
For what kind of surveys are these methods suitable?
Topic (vi): Points for discussion
 What are the experiences at other agencies when
incorporating machine learning methods into their
imputation menu (e.g., at BOC - B. Winkler 2009,
2010)?
Topic (vi): Points for discussion
The paper from Slovenia implements the
imputation model using an available procedure
within the SAS language, PROC MCMC:
 For some complex imputation problems (i.e. large
number of variables, different types of variables,
large data files) the default settings within
commercial software procedures may be
inappropriate. What type of diagnostics, graphical
or analytical, should be examined to ensure the
procedure is working properly?
Topic (vi): Points for discussion
How can we address the situation in which a
large proportion of the missing fields occur
within a particular data group?
Or the situation in which a particular data
group is too small to fit the model?
Topic (vi): Points for discussion
Incorporating soft or query edits into the error
localization problem increases the number of
variables and edits of this computationally intensive
optimization problem:
 How do we approach the added complexity? Is this new
approach computationally feasible?
 Does adding the information from the soft (query) edits
lead to a reduction of time and resources spend on
analysts’ review (trade-off )?
 What are the effects of adding more edits and thus
changing more fields on data quality?
New and Emerging Methods:
Closing Remarks
Closing Remarks
What happened to error localization?
Should we be focusing more on imputation
than on editing (again: error localization)?
Do we forecast multiple imputation as
becoming the “standard” at most NSIs:
– to smooth variability of single imputations?
– to estimate variance due to imputation?
– issue of non-response bias vs. variance
estimation?