* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Assessing the Uncertainty of Point Estimates We notice that, in
Survey
Document related concepts
Transcript
Assessing the Uncertainty of Point Estimates We notice that, in general, estimates face uncertainty due to several sources of errors. An (incomplete) list of possible sources of errors includes: Uncertainty due to several sources of errors 1. Sampling variability. 2. Overall data quality: Outliers, gross errors, asymmetric errors, etc. 3. Incomplete data: Missing values. 4. Duplications, inconsistency, and other problems occurring when merging of different databases. statisticians are mainly concerned with 1 and pay little attention to 2-4. 2-4 are an integral part of Statistical Science good statisticians should be capable of discovering problems in their datasets and hence provide sensible solutions. diagnostics tools that can be used to flag outliers and high leverage points 1 diagnostics tools are: 1. Unreliable 2. Very tedious. Imagine a situation with 25 variables: 25 × 24 / 2 = 300 pairwise plots (23 × 24 × 25) / (2 × 3) = 2300 three dimensional plots tens (or hundreds) of models fit to perform stepwise variable selection worry about possible problems that only appear in higher dimensional representations of the dataset We will now briefly discuss points 1 and 2 above. The discussion of 3 and 4 will be left for the next two sections. Sampling Variability Loosely speaking, sampling variability is the uncertainty in estimates (and other statistical procedures) due to the occurrence of “nice” measurement errors (usually represented by normal random variables) and limited sample sizes. Sampling uncertainty is measured by standard errors which are usually of the form SE = Constant √ , n where n is the sample size. Standard errors (and hence sampling uncertainty) can be reduced by increasing the number of observations and, in principle, can be made as small as desired by choosing an appropriate sample size. Statisticians have been very effective in explaining sampling variability to the scientific 2 community and we can now say that sampling variability is widely recognized and acknowledged. To the extend that some users of Statistics think that statisticians are experts on calculation of sample sizes. We Statisticians should, however, pay more attention to other sources of estimation errors and make these issues an integral part of our discipline. Overall Data Quality Lacking a better name, we will use the name “bias uncertainty” the refer to estimation errors caused by uneven data quality. Regarding bias uncertainty there are three important considerations: (a) how can bias uncertainty be measured? (b) how can bias uncertainty be kept under check? (c) how can bias uncertainty be formaly incorporated in inferential procedures? To answer the question in (a) we notice that, unlike sampling uncertainty, bias uncertainty cannot be easily measured by readily available quantities such as standard errors. One possible way of measuring bias uncertainty would be to calculate “worst case biases” (maxbiases). Maxbiases would give an idea of how bad things could get for a given fraction of contamination. From this point of view there is a strong reason to use robust statistical procedures with finite (and relatively small) maxbiases as opposed to classical LS procedures that have infinite maxbiases. Maxbias formulas are currently available for certain types of robust estimates and models but they are not easily computable and widely known as to encouage routine usage. Obviously more research is needed in this direction. To answer the question in (b) we notice that, unlike sampling uncertainty, bias uncertainty cannot be simply reduced by increasing sample sizes. As illustrated before, naive increases of sample sizes could make a bad situation even worse! An obvious way for reducing bias uncertainty is to increase data quality, but this is usually very expensive when at all possible. An alternative (cheaper) approach is to use robust statistical procedures which can filter out most of the effect of outliers, gross-errors and other asymmetric measurement errors. To acomplish the goal set in question in (c), we should use confidence intervals than have a pre-established confidence level not only at the central parametric model but also at all the distributions in the enlarged robust model. Such confidence interval must account for the possible bias effect of contamination in addition to the usual sampling variability. Unfortunately the resulting confidence intervals will be of larger length. In order to keep practical relevance the interval length should be sufficiently small. The increase in length 3 due to bias uncertainty will be smaller for more robust estimates. This highlight the need for “supper robust” point estimates with known small maxbiases and (non-parametrically esimable) standard errors. Toward a Global Robust Statistical Analysis As in previous sections, to introduce the main ideas, we will first consider the simple location-dispersion model yi = θ + σεi where εi ∼ (1 − 0.05)N (0, 1) + 0.05H, and where H is unknown and unspecified. To fix ideas, let’s suppose that H = N (0.5, 0.1) . Classical statisticians are likely to ignore the possibility of contamination in the data - for instance, the occurrence of a small fraction of asymmetric errors - and propose prefer estimates with small standard errors. In the locationdispersion case, the classical choices would be ȳ σ̂1 = Sample Average = 1 yi n = Sample Standard Deviation = 1 (yi − ȳ)2 n−1 We think that a more reasonable criterium to choose estimates would be to achieve a large probability of small estimation error. Mathematically: P θ̂ − θ < d = Large 4 (1) where d is some specified small number. A complementary (but not identical criterium) would be to use estimates with a small probability of large estimation error. Mathematically, P θ̂ − θ > D = Small (2) where D is some relatively large value. We will show that 1 and 2 can be better achieved by using θ̂ σ̂ 1 = Sample Median = MAD = Φ−1 3 Median yi − θ̂ 4 Perhaps more importantly, we will show that in the case of the median and MAD the worst case biases are rather small and can be measured quite accurately. Table 10.1 Probability of small error for the average. 5% contaminated standard normal. Normal contamination with mean 4 and standard deviation 0.1 d n = 100 n = 200 n = 500 0.05 0.060 0.017 0.001 0.10 0.146 0.058 0.016 0.15 0.306 0.217 0.136 Table 10.2 Probability of small error for the median. 5% contaminated standard normal. Normal contamination with mean 4 and standard deviation 0.1 d n = 100 n = 200 n = 500 0.05 0.264 0.321 0.353 0.10 0.496 0.647 0.721 0.15 0.681 0.832 0.936 Figures 1 and 2 show that the histograms for both, the average and the median estimation errors, are shifted to the right as a result of the contamination. 5 Figure 1: One thousand averages from contaminated samples. 95% of the data are standard normal and 5% are normal with mean 4 and standard deviation 0.1. (b) n = 200 (c) n = 500 -0.6 -0.2 0.2 Estimation Error 0.6 0 0 0 1 2 1 2 4 2 3 6 3 4 8 5 4 (a) n = 100 -0.6 -0.2 0.2 Estimation Error 6 0.6 -0.6 -0.2 0.2 Estimation Error 0.6 (b) n = 200 (c) n = 500 -0.6 -0.2 0.2 Estimation Error 0.6 0 0 0 1 2 1 2 4 2 3 6 3 4 (a) n = 100 -0.6 -0.2 0.2 Estimation Error 0.6 -0.6 -0.2 0.2 0.6 Estimation Error Figure 2: One thousand medians from contaminated samples. 95% of the data are standard normal and 5% are normal with mean 4 and standard deviation 0.1. 7 Notice that the effect of the contamination is much larger in the case of the mean and so the the probabilities of “small errors” for the average are rather small. An interesting phenomenom here is that the performances of both, the average and median, worsen for larger sample sizes. In other words, larger samples of poor quality do not help much the estimation process and have the effect of decreasing the probability of small estimation error! Maxbias Bound for the Median Suppose we have a large sample from a distribution F containing at most a fraction 100% of contamination. That is, F belongs to the contamination neighborhood F = {F : F = (1 − ) F0 + H} . Suppose we wish to bound the absolute difference between the median, M(F ), of the contaminated distribution F and the median, M (F0 ), of the core (uncontaminated) distribution: D(F ) = |M (F ) − M(F0 )|. Huber (1964) showed that and therefore M (F ) − M (F0 ) 1 ≤ F −1 = B() 0 σ0 2(1 − ) D(F ) ≤ σ 0 B(), for all F ∈ F . (3) (4) That is D(F ) is bounded by σ0 B(). Unfortunately, in practice σ0 is seldom known and must be estimated by a robust scale functional S(F ), e.g., the MAD. But the quantity K̃() = S(F )B() 8 is not an upper bound for D(F ) because in some situations S(F ) might underestimate σ 0 . For instance, if F = 0.90N(0, 1) + 0.10δ 0.15 , (5) then (see Problem ??) M edian (F ) = 0.1397, (6) M AD(F ) = 0.8818 (7) and then |M(F ) − M (F0 )| = 0.1397 > M AD(F )B(0.10) = 0.8818 × 0.1397 = 0.1232. Definition 1 A quantity, K() such that S(F )K() is a bound for D(F ) is called bias bound. In previous sections we derived formulas for the implosion and explosion biases of the dispersion functional SS+ () = sup F ∈F SS− () = inf F ∈F S(F ) σ0 (8) S(F ) σ0 (9) Using (4), (8) and (9) we obtain D(F ) ≤ σ0 BM () 9 = S(F )BM () ≤ S(F ) σ0 S(F ) BM () SS− () and so K() = BM ()/SS− () is an example of a bias bound. A refinement (of practical value when > 0.2) is provided by the following lemma. Lemma 2 Let M (F ) be an equivariant location functional with maxbias function BM () and breakdown point equal to 1/2. Let S(F ) be a dispersion Mfunctional with score function χ which is even, bounded, monotone on [0, ∞), continuous at 0 with 0 = χ(0) < χ(∞) = 1, and with at most a finite number of discontinuities. Suppose that S(F ) has breakdown point 1/2, that is EF0 {χ(X)} = 1/2. Let γ(t) be defined as the unique solution to (1 − )EF0 χ[(X − t)/γ(t)] = 1/2. Then K1 () = |t| , |t|≤BM () γ(t) sup is a bias bound for M(F ). 10 Proof: Since M is location equivariant, we can assume without loss of generality that θ0 = 0 and σ0 = 1. Let F,t = {F ∈ F : M (F ) = t}. Notice that |M(F )| ≤ BM (), F ∈ F and therefore F = |t|≤B() F,t . We have that sup |M (F )/S(F )| = F ∈F sup sup |t|≤BM () F ∈F,t |t| S(F ) (10) Now, for each F = (1 − )F0 + F̃ ∈ F,t , it holds X −t X −t S(F ) = sup{s > 0 : (1 − )EF0 χ + EF̃ χ > 1/2}, S(F ) S(F ) and therefore, S(F ) ≥ γ(t). This fact, together with (10), proves the result. 11 Table ?? gives the values of BM (), K() and K1 () for the median when S = MAD, for several values of and for F0 = Φ, the standard normal distribution. Notice that K() and K1 () are larger than BM () because they take into account the possible underestimation of σ 0 . Table 10.3. Maxbias and bias bound for the median (S = M AD) when F0 is the standard normal distribution. 0.05 0.10 0.15 0.20 0.25 0.30 BM () 0.066 0.140 0.223 0.319 0.431 0.566 K1 () 0.070 0.159 0.271 0.417 0.614 0.889 K() 0.070 0.160 0.278 0.440 0.675 1.043 It is clear from the discussion above that the link between the maxbias curves and bias bounds is not totally trivial in the location model, and it is even more involved in the regression setup. In the case of regression estimates we face a more challenging situation because maxbias curves for regression estimates are derived using a normalized distance (quadratic form) between the asymptotic and the true values of the regression coefficients. The normalization is based on a certain unknown scatter matrix of the regressors. Ideally, bias bounds for robust estimates of the regression coefficients should be reported together with the point estimates and their standard errors. So far this approach has been hindered by the fact that the maximum bias depends on the joint distribution of the regressors and the available maxbias formulas rely on unrealistic assumptions (regression-through-the-origin model and elliptical regressors). Our results (see Theorem ??) lift these theoretical hindrances and open the ground for the possible computation of bias bounds using data-based estimates of the regressors’ distribution. (***) Replace the paragraph above by the following : (***) Ideally, bias bounds for robust estimates of the regression coefficients should be reported together with the point estimates and their standard errors. So far this approach has been hindered by the fact that the maximum bias depends on the joint distribution of the regressors and the available maxbias formulas rely on unrealistic assumptions (regression-through-the-origin model and elliptical regressors). Our results (see Theorem ??) lift these theoretical hindrances and open the ground for the possible computation of bias bounds using data-based estimates of the regressors’ distribution, in the case of robust regression estimates satisfying equation (??) below. Similar results for other classes of robust estimates, e.g., one-step Newton-Raphson estimates (Simpson et al. (1992)), 12 projection estimates (Maronna and Yohai (1993)) and maximum depth estimates (Rousseeuw ???) would be desirable. 13