* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chap10: SUMMARIZING DATA
Survey
Document related concepts
Transcript
Chap 10: Summarizing Data 10.1: INTRO: Univariate/multivariate data (random samples or batches) can be described using procedures to reveal their structures via graphical displays (Empirical CDFs, Histograms,…) that are to Data what PMFs and PDFs are to Random Variables. Numerical summaries (location and spread measures) and the effects of outliers on these measures & graphical summaries (Boxplots) will be investigated. 10.2: CDF based methods 10.2.1: The Empirical CDF (ecdf) ECDF is the data analogue of the CDF of a random variable. Fn ( x) # xi x n , where x1....xn is a batch of numbers The ECDF is a graphical display that conveniently n summarizes data sets. Fn ( x) I i 1 ( , x ] (Xi ) , where X 1.... X n is a random sample n 1 , t A and I A (t ) is an indicator function 0 , t A The Empirical CDF (cont’d) The random variables I ( , x ] ( X i ) are independent Bernoulli random variables: 1 with probability F ( x) I ( , x ] ( X i ) 0 with probabilty 1 F ( x) n nFn ( x) I ( , x ] ( X i ) ~ Bin n, F ( x) i 1 E Fn ( x) F ( x) and Var Fn ( x) Fn is an unbiased estimate of F F ( x) 1 F ( x) n lim Var Fn ( x) 0 and Fn has a max imum var iance at the median of F t 10.2.2: The Survival Function In medical or reliability studies, sometime data consist of times of failure or death; thus, it becomes more convenient to use the survival functionS (t ) 1 F (t ) , T is a random var iable with cdf F rather than the CDF. The sample survival function (ESF) gives the proportion of the data greater than t and is given by: Sn (t ) 1 Fn (t ) Survival plots (plots of ESF) may be used to provide information about the hazard function that may be thought as the instantaneous rate of mortality for an individual alive at time t and is defined to be:f (t ) d d h(t ) 1 F (t ) dt log 1 F (t ) dt log S (t ) The Survival Function (cont’d) From page 149, the method for the first order: Var[ g ( X )] Var ( X )[ g '( X )] 2 2 1 F (t ) Var{log 1 F (t ) } Var Fn (t ) * n 1 F (t ) 1 F (t ) which expresses how extremely unreliable (huge variance for large values of t) the empirical logsurvival function is. 10.2.3:QQP(quantile-quantile plots) Useful for comparing CDFs by plotting quantiles of one dist’n versus quantiles of another dist’n. Additive treatment effect : X ~ F vs Y ~ G y p x p h G ( y ) F ( x h) QQ plot is a line with slope 1 and y int ercept h Multiplica tive treatment effect : X ~ F vs Y ~ G y y p cx p G ( y ) F for c 0 c QQ plot is a line with slope c and y int ercept 0 10.2.3: Q-Q plots Q-Q plot is useful in comparing CDFs as it plots the quantiles of one dist’n versus the quantiles of the other dist’n. Additive model: X ~ F and Y ~ G when Y X h G ( y ) F y h Q Q plot is a straigth line with slope 1and y int ercept h Multiplicative model: y X ~ F and Y ~ G when Y cX G ( y ) F c Q Q plot is a straigth line with slope c and y int ercept 0 10.3: Histograms, Density curves & Stem-and-Leaf Plots Kernel PDF estimate: iid X 1 ... X n f PDF Estimating function of f is f h Kernel PDF 1 x Let wh ( x) w be a smooth weight function h h 1 n 1 n x Xi Then f h ( x) wh ( x X i ) w n i 1 nh i 1 h h bandwidth parameter that controls the smoothness of f h h bin width of the histogram. Choose a ' reasonable ' h not too big !not too small! 10.4: Location Measures 1 n 10.4.1: The Arithmetic Mean x xi is sensitive to n i 1 outliers (not robust). 10.4.2: The Median ~ x is a robust measure of location. 10.4.3: The Trimmed Mean is another robust location measure 0.1 0.2 (highly recommended ) Step1 : order the data set Step 2 : discard lowest * 100% and highest * 100% Step 3 : takethe arithmetic mean for the remaining data n [ n ] 1 Step 4 : x x(i ) is the *100% trimmed mean n 2[n ] i [ n ]1 Location Measures (cont’d) The trimmed mean (discard only a certain number of the observations) is introduced as a natural compromise between the mean (discard no observations) and the median (discard all but 1 or 2 observations) x is was proposed Another compromise between x and ~ by Huber (1981) who suggested to minimize: n Xi with respect to , where ( x) is to be given. n X i 1 i 0 for , where ' or to solve i 1 (its solution will be called an M-estimate) 10.4.4: M-Estimates (Huber, 1964) 1 2 if | x | k n 2 x X i , where ( x) i 1 k | x | 1 k 2 if | x | k 2 is proportion al to x 2 inside [k , k ] ( x) replaces the parabolic arcs by straigth lines outside [k , k ] 1 Big k ˆ comes closer to the mean x ; ( x) x 2 or ( x) x 2 Small k ˆ comes closer to the median ~ x ; ( x) k | x | or ( x) k sgn( x) k correspond s to the mean x and k 0 correspond s to the median ~ x 3 k protects against outliers observations k away from the center 2 is suggested as a " mod erate" compromise. 10.4.4: M-Estimates (cont’d) M-estimates coincide with MLEs because: n Xi Xi Minimize wrt Maximize f wrt i 1 i 1 iid User function ( x) log f ( x) with X 1 , X 2 ,...., X n f n The computation of an M-estimate is a nonlinear minimization problem that must be solved using an iterative method (such as Newton-Raphson,…) Such a minimizer is unique for convex functions. Here, we assume that is known; but in practice, a robust estimate of (to be seen in Section 10.5) should be used instead. 10.4.5: Comparison of Location Estimates Among the location estimate introduced in this section, which one is the best? Not easy ! For symmetric underlying dist’n, all 4 statistics (sample mean, sample median, alpha-trimmed mean, and M-estimate) estimate the center of symmetry. For non symmetric underlying dist’n, these 4 statistics estimate 4 different pop’n parameters namely (pop’n mean, pop’n median, pop’n trimmed mean, and a functional of the CDF by ways of the weight function ). Idea: Run some simulations; compute more than one estimate of location and pick the winner. 10.4.6: Estimating Variability of Location Estimates by the Bootstrap Using a computer, we can generate (simulate) many samples B (large) of size n from a common known dist’n F. From each sample, we compute the value of the location estimate ˆ . * * * , ,..., The empirical dist’n of the resulting values 1 2 B is a good approximation (for large B) to the dist’n function of ˆ . Unfortunately, F is NOT known in general. Just plug-in the empirical cdf Fn for F and bootstrap ( = resample from Fn ). 1 Fn is a discrete PMF with same probabilit y n for each observed value x1 , x2 ,..., xn 10.4.6: Bootstrap (cont’d) A sample of size n from Fn is a sample of size n drawn with replacement from the observed data 1 n * that produce b (b 1,..., B) . Thus, 1 B * * 2 sˆ b x ,...., x B b 1 B 1 where * b* is the mean of the b* b 1,2,..., B B b 1 Read example A on page 368. Bootstrap dist’n can be used to form an approximate CI and to test for hypotheses. 10.5:Measures of Dispersion A measure of dispersion (scale) gives a numerical indication of the “scatteredness” of a batch of numbers. The most common measure of dispersion 1 n 2 is the sample standard deviation X i X s n 1 i 1 Like the sample mean, the sample standard deviation is NOT robust (sensitive to outliers). Two simple robust measures of dispersion are the IQR (interquartile range) and the MAD (median absolute deviation from the median). 10.6: Box Plots Tukey invented a graphical display (boxplot) that indicates the center of a data set (median), the spread of the data (IQR) and the presence of outliers (possible). Boxplot gives also an indication of the symmetry / asymmetry (skewness) of the dist’n of data values. Later, we will see how boxplots can be effectively used to compare batches of numbers. 10.7: Conclusion Several graphical tools were introduced in this chapter as methods of presenting and summarizing data. Some aspects of the sampling dist’ns (assume a stochastic model for the data) of these summaries were discussed. Bootstrap methods (approximating a sampling dist’n and functionals) were also revisited. Parametric Bootstrap: Example: Estimating a population mean It is known that explosives used in mining leave a crater that is circular in shape with a diameter that follows an exponential dist’n F ( x) 1 e x / , x 0 . Suppose a new form of explosive is tested. The sample crater diameters (cm) are as follows: 121 847 591 510 440 205 3110 142 65 1062 211 269 115 586 983 115 162 70 565 114 x 514.15 sample mean and s 685.60 sample SD 249.07,779.23 It would be inappropriate to use x t 0.95 n as a 90% CI for the pop’n mean via the t-curve (df=19) s Parametric Bootstrap: (cont’d) because such a CI is based on the normality assumption for the parent pop’n. The parametric bootstrap replaces the exponential pop’n dist’n F with unknown mean by the known exponential dist’n F* with mean * x 514.15 Then resamples of size n=20 are drawn from this surrogate pop’n. Using Minitab, we can generate B=1000 such samples of size n=20 and compute the sample mean of each of these B samples. A bootstrap CI can be obtained by trimming off 5% from each tail. Thus, a parametric bootstrap 90% CI is given by: (50th smallest = 332.51,951st largest = 726.45) Non-Parametric Bootstrap: If we do not assume that we are sampling from a normal pop’n or some other specified shape pop’n, then we must extract all the information about the pop’n from the sample itself. Nonparametric bootstrapping is to bootstrap a sampling dist’n for our estimate by drawing samples with replacement from our original (raw) data. Thus, a nonparametric bootstrap 90% CI of is obtained by taking the 5th and 95th percentiles of ˆ among these resamples.