Download Numerical Summaries

Numerical Summaries COMP 245 STATISTICS Dr N A Heard 1 Introduction Once a sample of data has been drawn from the population of interest, the first task of the statistical analyst might be to calculate various numerical summaries of these data. This procedure serves two purposes: • The first is exploratory. Calculating statistics which characterise general properties of the sample, such as location, dispersion, or symmetry, helps us to understand the data we have gathered. This aim can be greatly aided by the use of graphical displays representing the data. • The second, as we shall see later, is that these summaries will commonly provide the means for relating the sample we have learnt about to the wider population in which we are truly interested. Later, we will assess the properties of these numerical summaries as estimators of population parameters. 2 2.1 Summary Statistics Measures of Location Sample Mean The arithmetic mean (or just mean for short) of a sample of real values ( x1 , . . . , xn ) is the sum of the values divided by their number. That is, x̄ = 1 n xi n i∑ =1 This is often colloquially referred to as the average. Ex. The mean of (7, 2, 4, 12, 5) is 7 + 2 + 4 + 12 + 5 30 = = 6. 5 5 Order Statistics For a sample of real values ( x1 , . . . , xn ), define the ith order statistic x(i) to be the ith smallest value of the sample. So 1 • x(1) ≡ min( x1 , . . . , xn ) is the smallest value; • x(2) is the next smallest, and so on, up to .. . • x(n) ≡ max( x1 , . . . , xn ) being the largest value. Furthermore, in an abuse of notation it will be useful to define x(i+α) for integer 1 ≤ i < n and non-integer α ∈ (0, 1) as the linear interpolant x ( i + α ) = ( 1 − α ) x ( i ) + α x ( i +1) , where the order statistics x(i) are defined as before. r x ( i +1) x (i + α ) x (i ) r - i+1 i+α i Ex. x(4.2) = 0.8 × x(4) + 0.2 × x(5) . Sample Median The median of a sample of real values ( x1 , . . . , xn ) is the middle value of the order statistics. That is, using our extended notation, median = x({n+1}/2) ( x({n+1}/2) = x(n/2) + x(n/2+1) 2 if n is odd if n is even Ex. • The median of (7, 2, 4, 12, 5) is 5. • The median of (7, 2, 4, 12, 5, 15) is 6. Mean vs. Median The mean is sensitive to outlying points, whilst the median is not. Ex. (1, 2, 3, 4, 5) has median = mean = 3. (1, 2, 3, 4, 40) again has median = 3, but now mean = 10. The arithmetic mean is the most commonly used location statistic, followed by the median. 2 Sample Mode The mode of a sample of real values ( x1 , . . . , xn ) is the value of the xi which occurs most frequently in the sample. Ex. The mode of (3, 5, 7, 2, 10, 14, 12, 2, 5, 2) is 2. Some data sets are multimodal. Geometric Mean Two other useful measures of location (other averages). For positive data, the geometric mean s xG = n n ∏ xi . i =1 ( Note: It is easy to show that xG = exp ) 1 n log xi , the exponential of the arithmetic n i∑ =1 mean of the logs of the data. =⇒ It is thus less severely affected by exceptionally large values. Harmonic Mean The harmonic mean ( xH = 1 n 1 n i∑ x =1 i ) −1 = n ∑in=1 x1i . which is most useful when averaging rates. Note: For positive data ( x1 , . . . , xn ), Arithmetic mean ≥ geometric mean ≥ harmonic mean. 2.2 Measures of Dispersion Range The range of a sample of real values ( x1 , . . . , xn ) is the difference between the largest and the smallest values. That is range = x(n) − x(1) Ex. The range of (7, 2, 4, 12, 5) is 12 − 2 = 10. 3 1 2 3 Mean 4 5 Arithmetic Geometric Harmonic x1 2 4 6 8 10 x2 Arithmetic, geometric and harmonic means for two data points points ( x1 , x2 ), where x1 = 1. Quartiles Consider again the order statistics of a sample, x(1) , . . . , x(n) . 1 of the way through the ordered sam2 ple. (Not necessarily exactly or uniquely since there may be tied values or n even.) 1 3 Similarly, we can define the first and third quartiles respectively as being values and 4 4 of the way through the ordered sample. We defined the median so that it lay approximately Interquartile Range first quartile = x({n+1}/4) third quartile = x(3{n+1}/4) and thus we define the interquartile range as the range of the data lying between the first and third quartiles, interquartile range = third quartile − first quartile = x(3{n+1}/4) − x({n+1}/4) Five Figure Summary The five figure summary of a set of data lists, in order: • The minimum value in the sample • The lower quartile • The sample median 4 • The upper quartile • The maximum value Sample Variance The most widely used measure of dispersion is based on the squared differences between the data points and their mean, ( xi − x̄ )2 . The average (the mean) of these squared differences is the mean square or sample variance s2 = 1 n ( xi − x̄ )2 . n i∑ =1 Equivalently, it is often more convenient to rewrite this formula as s2 = 1 n 2 xi n i∑ =1 ! − 1 n xi n i∑ =1 !2 That is, the mean of the squares minus the square of the mean. Sample Standard Deviation The square root of the variance is the root mean square or sample standard deviation s s= 1 n ( xi − x̄ )2 n i∑ =1 Unlike the variance, the standard deviation is in the same units as the xi . Summary We can see analogies between the numerical summaries for location and dispersion, and their robustness properties are comparable. Location Dispersion Least Robust x (1) + x ( n ) 2 x ( n ) − x (1) More Robust Most Robust x̄ x({n+1}/2) s2 x(3{n+1}/4) − x({n+1}/4) x (1) + x ( n ) would be the midpoint of our data halfway between the minimum and 2 maximum values in the sample, which provides another alternative descriptor of location.) (where 5 2.3 Skewness Sample Skewness Skewness is a measure of asymmetry. The skewness of a sample of real values ( x1 , . . . , xn ) is given by 1 n skewness = ∑ n i =1 xi − x̄ s 3 . A sample is positively (negatively) or right (left) skewed if the upper tail of the histogram of the sample is longer (shorter) than the lower tail. Skewness Positive skew Negative skew Since the mean is more sensitive to outlying points than is the median, one might choose the median as a more suitable measure of ’average value’ if the sample is skewed. Can expect skewness for example when the data can only take positive (or only negative values) and if the values are not far from zero. Can remove skewness by transforming the data. In the case above, we need a transformation which has larger effect on the larger values: e.g. square root, log (though beware 0 values). Note: in a positively skewed sample the mean is often greater than the median. 2.4 Covariance and Correlation Sample Covariance Suppose we have a sample made up of ordered pairs of real values (( x1 , y1 ), . . . , ( xn , yn )). The value xi might correspond to the measurement of one quantity x of individual i, and yi to another quantity y of the same individual. The covariance between the samples of x and y is given by s xy = 1 n ( xi − x̄ )(yi − ȳ). n i∑ =1 6 It gives a measurement of relatedness between the two quantities x and y. As before, the covariance can be rewritten equivalently as ∑in=1 xi yi − x̄ ȳ. n s xy = Sample Correlation Note that the magnitude of s xy varies according to the scale on which the data have been measured. The correlation between the samples of x and y is defined to be r xy = = s xy s x sy ∑in=1 xi yi − n x̄ ȳ . ns x sy where s x and sy are the sample standard deviations of ( x1 , . . . , xn ) and (y1 , . . . , yn ) respectively. Unlike covariance, correlation gives a measurement of relatedness between the two quantities x and y which is scale-invariant. 3 3.1 Related Graphical Displays Box-and-Whisker Plots Box-and-Whisker/Box Plots Based on the five figure summary. • Median - middle line in the box • 3rd & 1st Quartiles - top and bottom of the box • ’Whiskers’ - extend out to any points which are within ( 23 × interquartile range) from the box • Any extreme points out to the maximum and minimum which are beyond the whiskers are plotted individually. 3.2 Empirical pmf & cdf Empirical pmf The empirical probability mass function (epmf) of a sample of real values ( x1 , . . . , xn ) is given by pn ( x ) = 1 n I ( x i = x ). n i∑ =1 For any real number x, pn ( x ) returns the proportion of the data which take the value x. The non-zero values of pn ( x ) are the normalised frequency counts of the data. Note that a mode of the sample ( x1 , . . . , xn ) yields a maximal value of pn ( x ). 7 25 20 15 Count 10 ● 0 5 ● A B C D E F Insecticide Ex. Box plots of the counts of insects found in agricultural experimental units treated with six different insecticides (A-F). Empirical cdf The empirical cumulative distribution function (ecdf) of a sample of real values ( x1 , . . . , xn ) is given by Fn ( x ) = 1 n I ( x i ≤ x ). n i∑ =1 For any real number x, Fn ( x ) returns the proportion of the data having values which do not exceed x. 1.0 Note this is a step function, with change points at the sampled values. ● ● ● ● ● ● 0.8 ● ● ● ● ● ● 0.6 ● ● ● Fn ● ● ● 0.4 ● ● 0.2 ● ● 0.0 ● ● 0 5 10 15 20 25 Count Ex. Collecting the insecticide data from before together across the different treatments. 8

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Numerical Summaries