Download Outlier detection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Statistical mechanics wikipedia , lookup

Statistical inference wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Outlier Treatment in HCSO
Present and future
Outline
• Outlier detection – types, editing, estimation
• Description of the current method
• Alternatives
• Future work
• Introduction of a new tool: R and Rstudio
UNECE Statistical Data Editing 2014
2
Outlier detection and treatment
Purpose of outlier detection
Estimation
Editing
• Representative outliers
• Non Representative outliers
Identify errors
• Decreasing weights
• Changing the values
• Using robust estimations
Source: MEMOBUST
UNECE Statistical Data Editing 2014
3
Monthly Survey of Manufacturing
• Take-all part
• Survey part:
• less than 50 employees (and more than 5, because the
smallest businesses are not in the scope of the survey).
• The sampling frame is based on the Register of Enterprises
(~10 thousand units)
•
The sampling ratio is about 15%
•
Stratified sample (a lot of NACE categories, categories of
the number of employees, and two territorial strata: the
capital and everything else). (Telegdi 2004.)
UNECE Statistical Data Editing 2014
4
Monthly Survey of Manufacturing: data
Distribution of some variables
• Skewed distribution
• Visible outliers
UNECE Statistical Data Editing 2014
5
Current method of outlier detection
• The aim of the outlier treatment is improving the estimation.
(Csereháti 2004.)
• Steps of the method:
1) Computing the outlier indicators
2) Manual outlier detection by the methodologist/expert
3) Transfer of the result to the subject matter statistician
4) Discussion of the result by the subject matter statistician
(possible modifications), resembles to the process of
selective editing
UNECE Statistical Data Editing 2014
6
Outlier indicators
𝐿𝑁𝑆𝑄𝑅𝑇𝑗𝑖 = 𝐿𝑛𝑌𝑗𝑖 ∙
•
𝑆𝑇𝐴𝑁𝐷𝐴𝑅𝐷𝑗𝑖
𝐺𝑐𝑟𝑖𝑡,𝑗
𝑆𝑇𝐴𝑁𝐷𝐴𝑅𝐷𝑗𝑖
𝑆𝑄𝑈𝐴𝑅𝐸𝐷𝑗𝑖 = 𝑌𝑗𝑖 ∙
𝐺𝑐𝑟𝑖𝑡,𝑗
2
𝑌𝑗𝑖
𝑀𝐸𝐴𝑁𝑋𝑗𝑖 =
𝑁 −1
𝑀𝐸𝐴𝑁𝑗 ∙ 𝑁𝑗 − 𝑌𝑗𝑖 𝑗
LNSQRT: main indicator
•
Grubbs crit. value
•
Standardized value of the variables
•
SQUARED: identifying highest values
•
MEANX is the ratio of the observed value of
the unit and the weighted mean of the
stratum without this unit value.
𝑁𝑗 − 𝑛𝑗
𝑀𝐸𝐴𝑁𝑗 ∙ 𝑁𝑗
𝑉𝐴𝐿𝑂𝑈𝑇𝑗𝑖 =
𝑌 −
𝑛𝑗 − 1 𝑗𝑖
𝑛𝑗
𝑃𝑉𝐴𝐿𝑂𝑈𝑇𝑗𝑖 =
𝑉𝐴𝐿𝑂𝑈𝑇𝑗𝑖
∙ 𝑛𝑗
2
𝑁𝑗 ∙ 𝑀𝐸𝐴𝑁𝑗
•
VALOUT indicator shows the difference
between the estimation of the total with and
without the given value in a given stratum.
UNECE Statistical Data Editing 2014
7
The main indicator: LNSQRT
UNECE Statistical Data Editing 2014
8
Outlier treatment
• Weight trimming: weights of the outliers are changed to 1
• Number of outliers: avg. 2% of the cases
• Change in the estimates:
• Mean: -15% (in avarage)
• Variance: serious decrease
UNECE Statistical Data Editing 2014
9
Alternative methods
• One dimensional methods
• Median absolute deviation
• Custom indicator: share in total
• Quantile
Disadvantage: applying to many variables
• Multidimensional method:
• Mahalanobis distance based outlier detection
UNECE Statistical Data Editing 2014
10
Share in total, a custom indicator
1
𝑂𝑈𝑇𝑖 = (1 − 𝑛 ) ∗
𝑗
𝑥𝑖𝑗
𝑛𝑗
𝑖
𝑥𝑖𝑗
• To consider the individual value and the size of the
stratum in the same formula
• inspired by the current indicators
• The possible outlier:
• shares a considerably great amount of the total
• In a big stratum
• The indicator computed for each stratum
UNECE Statistical Data Editing 2014
11
Results
• Quantile method
• Threshold 99%
• The method can identify almost the same
outliers as the current one.
• Easy to implement
• MAD
• Problem of the k (threshold)
• Too many cases were selected
UNECE Statistical Data Editing 2014
12
Results (2)
• Share in total
• Threshold value: 0.5
• Smaller number of outliers
• Mahalanobis distance
• We used the robust Mahalanobis distance
• 3 key variables (Total revenue etc.)
• These are not involved in the current method
• avoiding missing values
• Similar results (2/3 of the current outliers are
detected)
UNECE Statistical Data Editing 2014
13
UNECE Statistical Data Editing 2014
14
Future plans
• Development of methodology:
– More analysis of the effect on estimates
–
Winsorization
• Development of the process
– Automation and reproducibility
–
More informative report on the process, to help
better understand and analyse the process steps
UNECE Statistical Data Editing 2014
15
Experimental tools
• Outlier treatment is separated from other steps of
data process, belongs to the methodology
• Possible new tool: R (with Rstudio)
• Advantage: ease of development
• Ready-to-use functions for outlier detection
• Disadvantage: need of „expert” user, not a usual
tool
UNECE Statistical Data Editing 2014
16
Thank you for your attention!
UNECE Statistical Data Editing 2014
17