Download Assessing the Uncertainty of Point Estimates We notice that, in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Bias of an estimator wikipedia , lookup

Least squares wikipedia , lookup

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Robust statistics wikipedia , lookup

Transcript
Assessing the Uncertainty of Point Estimates
We notice that, in general, estimates face uncertainty due to several sources
of errors. An (incomplete) list of possible sources of errors includes:
Uncertainty due to
several sources of
errors
1. Sampling variability.
2. Overall data quality: Outliers, gross errors,
asymmetric errors, etc.
3. Incomplete data: Missing values.
4. Duplications, inconsistency, and other problems
occurring when merging of different databases.
statisticians are mainly concerned
with 1 and pay little attention to 2-4.
2-4 are an integral part of Statistical Science
good statisticians should be capable
of discovering problems in their datasets
and hence provide sensible solutions.
diagnostics tools that can be used
to flag outliers and high leverage points
1
diagnostics tools are:
1. Unreliable
2. Very tedious.
Imagine a situation with 25 variables:
25 × 24 / 2 = 300 pairwise plots
(23 × 24 × 25) / (2 × 3) = 2300 three dimensional plots
tens (or hundreds) of models fit to perform
stepwise variable selection
worry about possible problems that
only appear in higher dimensional
representations of the dataset
We will now briefly discuss points 1 and 2 above. The discussion of 3 and 4
will be left for the next two sections.
Sampling Variability
Loosely speaking, sampling variability is the uncertainty in estimates (and
other statistical procedures) due to the occurrence of “nice” measurement errors
(usually represented by normal random variables) and limited sample sizes.
Sampling uncertainty is measured by standard errors which are usually of the
form
SE
=
Constant
√
,
n
where n is the sample size. Standard errors (and hence sampling uncertainty)
can be reduced by increasing the number of observations and, in principle, can
be made as small as desired by choosing an appropriate sample size. Statisticians have been very effective in explaining sampling variability to the scientific
2
community and we can now say that sampling variability is widely recognized
and acknowledged. To the extend that some users of Statistics think that statisticians are experts on calculation of sample sizes. We Statisticians should,
however, pay more attention to other sources of estimation errors and make
these issues an integral part of our discipline.
Overall Data Quality
Lacking a better name, we will use the name “bias uncertainty” the refer
to estimation errors caused by uneven data quality. Regarding bias uncertainty
there are three important considerations:
(a) how can bias uncertainty be measured?
(b) how can bias uncertainty be kept under check?
(c) how can bias uncertainty be formaly incorporated in inferential procedures?
To answer the question in (a) we notice that, unlike sampling uncertainty,
bias uncertainty cannot be easily measured by readily available quantities such
as standard errors. One possible way of measuring bias uncertainty would be
to calculate “worst case biases” (maxbiases). Maxbiases would give an idea of
how bad things could get for a given fraction of contamination. From this point
of view there is a strong reason to use robust statistical procedures with finite
(and relatively small) maxbiases as opposed to classical LS procedures that have
infinite maxbiases. Maxbias formulas are currently available for certain types
of robust estimates and models but they are not easily computable and widely
known as to encouage routine usage. Obviously more research is needed in this
direction.
To answer the question in (b) we notice that, unlike sampling uncertainty,
bias uncertainty cannot be simply reduced by increasing sample sizes. As illustrated before, naive increases of sample sizes could make a bad situation even
worse! An obvious way for reducing bias uncertainty is to increase data quality,
but this is usually very expensive when at all possible. An alternative (cheaper)
approach is to use robust statistical procedures which can filter out most of the
effect of outliers, gross-errors and other asymmetric measurement errors.
To acomplish the goal set in question in (c), we should use confidence intervals than have a pre-established confidence level not only at the central parametric model but also at all the distributions in the enlarged robust model.
Such confidence interval must account for the possible bias effect of contamination in addition to the usual sampling variability. Unfortunately the resulting
confidence intervals will be of larger length. In order to keep practical relevance the interval length should be sufficiently small. The increase in length
3
due to bias uncertainty will be smaller for more robust estimates. This highlight the need for “supper robust” point estimates with known small maxbiases
and (non-parametrically esimable) standard errors.
Toward a Global Robust Statistical Analysis
As in previous sections, to introduce the main ideas, we will first consider
the simple location-dispersion model
yi
= θ + σεi
where
εi
∼ (1 − 0.05)N (0, 1) + 0.05H,
and where H is unknown and unspecified. To fix ideas, let’s suppose that
H
= N (0.5, 0.1) .
Classical statisticians are likely to ignore the possibility of contamination in
the data - for instance, the occurrence of a small fraction of asymmetric errors
- and propose prefer estimates with small standard errors. In the locationdispersion case, the classical choices would be
ȳ
σ̂1
= Sample Average =
1
yi
n
= Sample Standard Deviation =
1 (yi − ȳ)2
n−1
We think that a more reasonable criterium to choose estimates would be to
achieve a large probability of small estimation error. Mathematically:
P θ̂ − θ < d
= Large
4
(1)
where d is some specified small number.
A complementary (but not identical criterium) would be to use estimates
with a small probability of large estimation error. Mathematically,
P θ̂ − θ > D
= Small
(2)
where D is some relatively large value. We will show that 1 and 2 can be better
achieved by using
θ̂
σ̂ 1
= Sample Median
= MAD = Φ−1
3
Median yi − θ̂ 4
Perhaps more importantly, we will show that in the case of the median
and MAD the worst case biases are rather small and can be measured quite
accurately.
Table 10.1 Probability of small error for the average. 5% contaminated
standard normal. Normal contamination with mean 4 and standard deviation
0.1
d
n = 100
n = 200
n = 500
0.05
0.060
0.017
0.001
0.10
0.146
0.058
0.016
0.15
0.306
0.217
0.136
Table 10.2 Probability of small error for the median. 5% contaminated
standard normal. Normal contamination with mean 4 and standard deviation
0.1
d
n = 100
n = 200
n = 500
0.05
0.264
0.321
0.353
0.10
0.496
0.647
0.721
0.15
0.681
0.832
0.936
Figures 1 and 2 show that the histograms for both, the average and the median estimation errors, are shifted to the right as a result of the contamination.
5
Figure 1: One thousand averages from contaminated samples. 95% of the data
are standard normal and 5% are normal with mean 4 and standard deviation
0.1.
(b) n = 200
(c) n = 500
-0.6
-0.2
0.2
Estimation Error
0.6
0
0
0
1
2
1
2
4
2
3
6
3
4
8
5
4
(a) n = 100
-0.6
-0.2
0.2
Estimation Error
6
0.6
-0.6
-0.2
0.2
Estimation Error
0.6
(b) n = 200
(c) n = 500
-0.6
-0.2
0.2
Estimation Error
0.6
0
0
0
1
2
1
2
4
2
3
6
3
4
(a) n = 100
-0.6
-0.2
0.2
Estimation Error
0.6
-0.6
-0.2
0.2
0.6
Estimation Error
Figure 2: One thousand medians from contaminated samples. 95% of the data
are standard normal and 5% are normal with mean 4 and standard deviation
0.1.
7
Notice that the effect of the contamination is much larger in the case of the
mean and so the the probabilities of “small errors” for the average are rather
small.
An interesting phenomenom here is that the performances of both, the average and median, worsen for larger sample sizes. In other words, larger samples
of poor quality do not help much the estimation process and have the effect of
decreasing the probability of small estimation error!
Maxbias Bound for the Median
Suppose we have a large sample from a distribution F containing at most
a fraction 100% of contamination. That is, F belongs to the contamination
neighborhood
F
= {F : F = (1 − ) F0 + H} .
Suppose we wish to bound the absolute difference between the median, M(F ),
of the contaminated distribution F and the median, M (F0 ), of the core (uncontaminated) distribution:
D(F ) = |M (F ) − M(F0 )|.
Huber (1964) showed that
and therefore
M (F ) − M (F0 ) 1
≤ F −1
= B()
0
σ0
2(1 − )
D(F ) ≤ σ 0 B(),
for all F ∈ F .
(3)
(4)
That is D(F ) is bounded by σ0 B(). Unfortunately, in practice σ0 is seldom
known and must be estimated by a robust scale functional S(F ), e.g., the MAD.
But the quantity
K̃() =
S(F )B()
8
is not an upper bound for D(F ) because in some situations S(F ) might underestimate σ 0 . For instance, if
F
= 0.90N(0, 1) + 0.10δ 0.15 ,
(5)
then (see Problem ??)
M edian (F ) = 0.1397,
(6)
M AD(F ) = 0.8818
(7)
and then
|M(F ) − M (F0 )|
= 0.1397
> M AD(F )B(0.10)
= 0.8818 × 0.1397 = 0.1232.
Definition 1 A quantity, K() such that S(F )K() is a bound for D(F ) is
called bias bound.
In previous sections we derived formulas for the implosion and explosion
biases of the dispersion functional
SS+ () =
sup
F ∈F
SS− () =
inf
F ∈F
S(F )
σ0
(8)
S(F )
σ0
(9)
Using (4), (8) and (9) we obtain
D(F ) ≤ σ0 BM ()
9
= S(F )BM ()
≤ S(F )
σ0
S(F )
BM ()
SS− ()
and so
K() = BM ()/SS− ()
is an example of a bias bound.
A refinement (of practical value when > 0.2) is provided by the following
lemma.
Lemma 2 Let M (F ) be an equivariant location functional with maxbias function BM () and breakdown point equal to 1/2. Let S(F ) be a dispersion Mfunctional with score function χ which is even, bounded, monotone on [0, ∞),
continuous at 0 with
0 = χ(0) < χ(∞) = 1,
and with at most a finite number of discontinuities. Suppose that S(F ) has
breakdown point 1/2, that is
EF0 {χ(X)} = 1/2.
Let γ(t) be defined as the unique solution to
(1 − )EF0 χ[(X − t)/γ(t)] = 1/2.
Then
K1 () =
|t|
,
|t|≤BM () γ(t)
sup
is a bias bound for M(F ).
10
Proof: Since M is location equivariant, we can assume without loss of generality that θ0 = 0 and σ0 = 1.
Let
F,t
= {F ∈ F : M (F ) = t}.
Notice that
|M(F )| ≤ BM (),
F ∈ F
and therefore
F
=
|t|≤B()
F,t .
We have that
sup |M (F )/S(F )| =
F ∈F
sup
sup
|t|≤BM () F ∈F,t
|t|
S(F )
(10)
Now, for each
F
= (1 − )F0 + F̃ ∈ F,t ,
it holds
X −t
X −t
S(F ) = sup{s > 0 : (1 − )EF0 χ
+ EF̃ χ
> 1/2},
S(F )
S(F )
and therefore,
S(F ) ≥ γ(t).
This fact, together with (10), proves the result.
11
Table ?? gives the values of BM (), K() and K1 () for the median when
S = MAD, for several values of and for F0 = Φ, the standard normal distribution. Notice that K() and K1 () are larger than BM () because they take
into account the possible underestimation of σ 0 .
Table 10.3. Maxbias and bias bound for the median (S = M AD) when F0
is the standard normal distribution.
0.05
0.10
0.15
0.20
0.25
0.30
BM ()
0.066
0.140
0.223
0.319
0.431
0.566
K1 ()
0.070
0.159
0.271
0.417
0.614
0.889
K()
0.070
0.160
0.278
0.440
0.675
1.043
It is clear from the discussion above that the link between the maxbias curves
and bias bounds is not totally trivial in the location model, and it is even more
involved in the regression setup.
In the case of regression estimates we face a more challenging situation because maxbias curves for regression estimates are derived using a normalized
distance (quadratic form) between the asymptotic and the true values of the regression coefficients. The normalization is based on a certain unknown scatter
matrix of the regressors.
Ideally, bias bounds for robust estimates of the regression coefficients should
be reported together with the point estimates and their standard errors. So far
this approach has been hindered by the fact that the maximum bias depends on
the joint distribution of the regressors and the available maxbias formulas rely
on unrealistic assumptions (regression-through-the-origin model and elliptical
regressors). Our results (see Theorem ??) lift these theoretical hindrances and
open the ground for the possible computation of bias bounds using data-based
estimates of the regressors’ distribution.
(***) Replace the paragraph above by the following : (***)
Ideally, bias bounds for robust estimates of the regression coefficients should
be reported together with the point estimates and their standard errors. So far
this approach has been hindered by the fact that the maximum bias depends on
the joint distribution of the regressors and the available maxbias formulas rely
on unrealistic assumptions (regression-through-the-origin model and elliptical
regressors). Our results (see Theorem ??) lift these theoretical hindrances and
open the ground for the possible computation of bias bounds using data-based
estimates of the regressors’ distribution, in the case of robust regression estimates satisfying equation (??) below. Similar results for other classes of robust
estimates, e.g., one-step Newton-Raphson estimates (Simpson et al. (1992)),
12
projection estimates (Maronna and Yohai (1993)) and maximum depth estimates (Rousseeuw ???) would be desirable.
13