Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 2 Descriptive Statistics We begin with the set of data below, with the measurements indicating total protein, measured in µg/ml. This is an example of raw data. 76.33 78.15 58.50 54.07 59.20 74.78 106.00 153.56 62.32 59.76 57.68 77.63 149.49 54.38 55.47 51.70 85.40 41.98 69.91 128.40 88.17 84.70 44.40 57.73 88.78 86.24 95.06 114.79 53.07 72.30 59.36 67.10 109.30 82.60 62.80 61.90 77.40 57.90 91.47 71.50 61.70 61.10 63.96 54.41 83.82 79.55 70.17 55.05 100.36 51.16 72.10 73.53 47.23 35.90 72.20 66.60 95.33 73.50 62.20 67.20 44.73 This is a sample of size n = 61. What can we tell about this data in its current form? Not much, actually. Ordered Array – data arranged from smallest to largest (usually). So we arrange our data into an ordered array. 6 2. DESCRIPTIVE STATISTICS 35.90 41.98 44.40 51.70 53.07 54.07 55.47 57.68 57.73 59.36 59.76 61.10 62.32 62.80 63.96 69.91 70.17 71.50 73.50 73.53 74.78 78.15 79.55 82.60 86.24 88.17 88.78 100.36 106.00 109.30 153.56 44.73 54.38 57.90 61.70 66.60 72.10 76.33 83.82 91.47 91.47 7 47.23 51.16 54.41 55.05 58.50 59.20 61.90 62.20 67.10 67.20 72.20 72.30 77.40 77.63 84.70 85.40 95.06 95.33 95.06 149.49 Now what can we say about our data? – the minimum value is 35.90 and the maximum is 153.56. – the middle of the data is in the 60’s or 70’s. Even ordering this data does not give us a good picture of what is happening. Grouped Data – the Frequency Distribution We need to select a set of contiguous, non-overlapping intervals such that each value in the set of observations can be placed in exactly one interval, referred to as the class intervals. Generally, 5 # of intervals k 15. One can use Sturges rule as a guide: k = 1 + 3.322 log10 n where n is the number of observations. In our example, n = 61, giving us k = 1 + 3.322 log10 61 ⇡ 6.93. So, rounding o↵, n = 7. From our data, the R = range = maximum minimum = 153.56 35.9 = 117.66, 8 2. DESCRIPTIVE STATISTICS so R 117.66 = = 16.81 k 7 Now, when the nature of the data makes them appropriate, class interval widths of 5, 10, or multiples of 10 units make the summarization more comprehensible. Here we will choose class intervals of 20 µg/ml, with the first interval beginning at 30. We will label the intervals by their midpoints. interval width = w = Class intervals Frequency Midpoint 30 x < 50 5 40 50 x < 70 26 60 70 x < 90 20 80 90 x < 110 6 100 110 x < 130 2 120 130 x < 150 1 140 150 x < 170 1 160 — 61 For computations with grouped data, each element in an interval is given the value of the midpoint of the interval. Thus each of the 26 values in the interval 50 x < 70 is treated as though it is 60. Note. A value falling on the interval boundary is placed in the higher valued interval (to the right on a number line). Although we can now see where the majority of the data lies and how it is spread (and graphs add to this), the data items lose their individual values to the midpoint value of the interval in which they lie. Relative Frequencies — the proportion of values falling into a class interval. We divide the number of values in each category by the total number of values. There are times when we will interpret the relative frequencies as the probability of occurence within a given interval, called the experimental probability or the 2. DESCRIPTIVE STATISTICS 9 empirical probability. In the following table we also incorporate cumulative frequencies and relative cumulative frequencies. Cumulative Cumulative Relative Relative Class intervals Midpoint Frequency Frequency Frequency Frequency 30 x < 50 40 5 5 .0820 .0820 50 x < 70 60 26 31 .4262 .5082 70 x < 90 80 20 51 .3279 .8361 90 x < 110 100 6 57 .0984 .9344 110 x < 130 120 2 59 .0328 .9672 130 x < 150 140 1 60 .0164 .9836 150 x < 170 160 1 61 .0164 1.0000 — —— 61 1.0001 Except for round-o↵ errors, the 1.0001 in the Relative Frequency column should always be 1.0000. 10 2. DESCRIPTIVE STATISTICS Frequency Histogram and Frequency Polygon – special types of bar and line graphs. Here we show the frequency polygon superimposed over the frequency histogram, as created in Maple. They are commonly separate graphs. In this case, the bars of the histogram are labeled by their midpoints on the horizotal axis. The points on the horizontal axis where the bars meet are called cut points, which may be used instead of the midpoints to label the horizontal axis. The frequency polygon is always labeled by the midpoints. The area under the histogram is 61 ⇥ 20 = 1220 (n⇥ interval width). With the lines of the frequency polygon joining the midpoints of the bars along with the midpoints of the adjoining intervals, the area of the frequency polygon is the same as that of the frequency histogram. Suppose we look at the same data with class intervals of width 10. The following table is also from Maple. 2. DESCRIPTIVE STATISTICS 11 The class intervals are not labeled, but are class width, frequency, relative frequency, cumulative frequency, and relative cumulative frequency. The frequency histogram follows. 12 2. DESCRIPTIVE STATISTICS With this histogram, the two values over 130 appear to be outliers (somewhat disjoint from the rest of the data). Relative Frequency Histogram and Relative Frequency Polygon Maple. See hist.mw or hist.pdf.. 2. DESCRIPTIVE STATISTICS 13 Stem-and-Leaf Displays – bears a strong resemblance to the histogram and serves the same purpose. Here are the ages of 48 students in a statistics course: 1) Use the first part of the data as a stem – write them vertically. 2) Use the last part as a leaf, in increasing order – we sometimes truncate or round – leaves are one digit only The last step is to put the leaves in increasing order. We can split stems to show more detail: 0–4 and 5–9. 14 2. DESCRIPTIVE STATISTICS Advantages – quick visual picture of the data. – see the actual values Disadvantages – best for small data sets (n 100) – can give a poor picture of the data Statistic – a descriptive measure computed from a sample Parameter – a descriptive measure computed from a population Measures of Central Tendency – mean, median, and mode. We want a single value that is typical of the data as a whole. (Arithmetic) Mean – average. X = random variable (RV) xi = specific values of X N = number of values in a finite population n = number of values in a sample For ungrouped data: population: sample: µ= x= N X xi i=1 N n X i=1 n xi 2. DESCRIPTIVE STATISTICS Example (Protein). x= n X xi i=1 = n 4717.08 = 77.32918033 61 For grouped data: Class intervals Midpoint=xi Frequency=fi xifi 30 x < 50 40 5 200 50 x < 70 60 26 1560 70 x < 90 80 20 1600 90 x < 110 100 6 600 110 x < 130 120 2 240 130 x < 150 140 1 140 150 x < 170 160 1 160 — —— 61 4500 µ= x= 7 X xifi i=1 n = 7 X xifi i=1 61 4500 = 73.7704918 61 15 16 2. DESCRIPTIVE STATISTICS Properties of the Mean (1) Uniqueness – for a given set of data, there is exactly one arithmetic mean. (2) Simplicity – the arithmetic mean is easily understood and easy to compute. (3) Since each and every value in a set of data enters into the computation of the mean, it is a↵ected by each value. Extreme values, therefore, have an influence on the mean and, in some cases, can so distort it that it becomes undesirable as a measure of central tendency. Outliers (Extreme Values) – values that deviate appreciably from most of the measurements in a data set. Robust Estimators – estimators that are insensitive to outliers. Trimmed Mean – a robust estimator of central tendency. For a set of sample data containing n measurements we calculate the 100↵ percent trimmed mean as follows: (1) Order the measurements. (2) Discard the smallest 100↵ percent and the largest 100↵ percent of the measurements. The recommended value of ↵ is something between .1 and .2. (3) Compute the arithmetic mean of the remaining measurements. Example (Protein). (1) The 5% trimed mean (removing 3 elements from each end of the data) is 71.2609090909090952. (2) The 10% trimed mean (removing 6 elements from each end of the data) is 70.3293877551020472. (3) The 20% trimed mean (removing 12 elements from each end of the data) is 69.3051351351351315. 2. DESCRIPTIVE STATISTICS 17 Median – a value that divides the ordered array into two equal parts. We order n+1 the data points from smallest to largest and then take item in order. 2 Example. n+1 6 (1) 1 3 8 13 2000 =) = = 3. |{z} 2 2 median = 8 n+1 7 (2) 1 5 8| {z11} 13 21 =) = = 3.5. 2 2 8 + 11 median = = 9.5 2 (3) For our data set with n = 61, the median of the ungrouped data is 69.91. (4) For the grouped data on Page 15, the median is 60, the 31st element of the set where each data point takes on the value of the midpoint of its class interval. Does this seem like a good measure of central tendency in this case? Obviously not! When your only source is grouped data, don’t put too much confidence in mean and median. Properties of the Median (1) Uniqueness – as was true with the mean, there is a unique median for a given set of data. (2) Simplicity – the median is easy to calculate. (3) Robustness – it is not as drastically a↵ected by extreme values as is the mean. Mode – the value that occurs most frequently. If all the data items are di↵erent, there is no mode. A set of data may have more than one mode (this is common for grouped data). A data set with two modes is called bimodal. 18 2. DESCRIPTIVE STATISTICS Skewness – classification of data distributions on the basis of whether they are symmetric or asymmetric. (1) Symmetric – the left half of its graph (histogram or frequency polygon) will be a mirror image of it right half. (2) Asymmetric –not symmetric. Definition. If the graph (histogram or frequency polygon) of a distribution is asymmetric, the distribution is said to be skewed. If a distribution is not symmetric because its graph extends further to the right than to the left, that is, if it has a long tail to the right, we say that the distribution is skewed to the right or positively skewed. If a distribution is not symmetric because its graph extends further to the left than to the right, that is, if it has a long tail to the left, we say that the distribution is skewed to the left or negatively skewed. The Skewness Statistic n p X n (xi Skewness = ✓ n i=1 X (xi i=1 n 3 x) x)2 ◆3/2 = p X n (xi x)3 i=1 (n p 1) n 1 s3 . 2. DESCRIPTIVE STATISTICS 19 The skewness statistic is 0 for a perfectly symmetric distribution, positive for a positively skewed distribution (skewed to the right), and negative for a negativly skewed distribution (skewed to the left). Typically, for unimodal distributions, if it is skewed to the left, mean < median < mode, and if it is skewed to the right, mode < median < mean. If you set a distribution on a fulcrum, the mean is where it balances. The median is the point that divides the area in half, and the mode is the highest point. Measures of Dispersion – describe the variation, spread, and scatter of the distribution. Range – the di↵erence between the largest and smallest values in a set of observations. Range = xL xS . This conveys minimal information and is a poor measure for large samples. Variance - measures dispersion based on how the data points are scattered about the mean. 20 2. DESCRIPTIVE STATISTICS Sample Variance (ungrouped) n X (xi x)2 s2 = i=1 n 1 = n X n (xi)2 i=1 n(n ✓X ◆2 n xi i=1 1) Problem (Page 53#2.5.2). x = 540 xi 500 570 560 570 450 560 570 3780 xi x -40 30 20 30 -90 20 30 0 |{z} ? (xi x)2 1600 900 400 900 8100 400 900 13200 (xi)2 250000 324900 313600 324900 202500 313600 324900 2054400 ? – except for rounding errors, this is always 0. 13200 s2 = = 2200 6 or 7(2054400) 37802 2 s = = 2200. 7(6) Example (Protein). s2 = 548.942654316940. . 2. DESCRIPTIVE STATISTICS 21 Sample Variance (grouped) X (xi x)2fi s2 = P fi 1 Example (Protein). x = 73.77 Class intervals 30 x < 50 50 x < 70 70 x < 90 90 x < 110 110 x < 130 130 x < 150 150 x < 170 xi 40 60 80 100 120 140 160 s2 = xi x -33.77 -13.77 6.23 26.23 46.23 66.23 86.23 (xi x)2 1140.4129 189.6129 38.8129 688.0129 2137.2129 4386.4129 7435.6129 fi 5 26 20 6 2 1 1 — 61 (xi x)2fi 5702.0645 4929.9354 776.2580 4128.0774 4274.4258 4386.4129 7435.6129 —— 31632.7869 31632.7869 = 527.213115 60 Notice howPthe variance changes P with the grouping. We divide by n 1 instead of n and fi 1 instead of fi in order to use the sample variance in inference procedures discussed later. This is because dividing by n 1 better approximates (is an unbiased estimator) the population variance. Also, we say we have n 1 degrees of freedom, i.e., once we have made n 1 choices, the last choice is determined. Population Variance 2 = N X (xi µ)2 i=1 N Problem – the variance units are the square of the data units. 22 2. DESCRIPTIVE STATISTICS Standard Deviation (SD) – the square root of the variance - has the same units as the data. Sample SD: p s = s2 Problem (Page 53#2.5.2). s= Example (Protein). s= p 2200 = 46.9042 p 527.213115 = 22.9611 Population SD: p 2 = Coefficient of Variation – used for comparing the variation of two or more distarbutions. This would seem to require ratio scales. The coefficient of variation expresses the SD as a percentage of the mean. s CV = (100) x Example. 5 x = 10, s = 5, CV = (100) = 50% 10 vs. 5 x = 100, s = 5, CV = (100) = 5% 100 Five Number Summary Definition. Given a set of n observations x1, x2, . . . , xn, the pth percentile P is the value of X such that p percent or less of the observations are less than P and (100 p) percent or less of the observations are greater than P . Notation. P10 denotes the 10th percentile, etc. P25 is called the first quartile (Q1). P50, the median, is the middle or second quartile (Q2). P75 is the third quartile (Q3). 2. DESCRIPTIVE STATISTICS n+1 th ordered observation 4 Example (Protein). n + 1 61 + 1 62 = = = 15.5 4 4 4 Thus take the number 1/2 way from the 15th to the 16th observation. 1st quartile: Q1 = Q1 = 57.73 | {z } +.5(57.90 | {z } 57.73 | {z }) = 57.815 15th 16th 15th 2(n + 1) n + 1 2nd quartile: Q2 = = th ordered observation 4 2 Example (Protein). n + 1 61 + 1 62 = = = 31 2 2 2 Thus take the 31st observation. Q2 = 69.91 3(n + 1) th ordered observation 4 Example (Protein). 3(n + 1) 3(61 + 1) 3(62) 186 = = = = 46.5 4 4 4 4 Thus take the number 1/2 way from the 46th to the 47th observation. 3rd quartile: Q3 = Q3 = 83.82 | {z } +.5(84.70 | {z } 46th 47th The five-number summary is then 83.82 | {z }) = 84.26 46th minimum – Q1 –median – Q3 – maximum Example (Protein). The five-number summary is 35.9 – 57.815 – 69.91 – 84.26 –153.56 23 24 2. DESCRIPTIVE STATISTICS Definition. The interquartile range (IQR) is the di↵erence between the third and first quartiles: IQR = Q3 Q1. Box-and-Whisker Plots (or Boxplots) – This is a graphical represntation of the five-number summary. It can be drawn vertically (left) or horizontally (right). The box shows the interquartile range, extending from Q1 to Q3. The width of the box is arbitrary. The line through the box shows the median. The whiskers extend from the box to the minimum and maximum values. It is di↵erent in SPSS. 2. DESCRIPTIVE STATISTICS 25 The whiskers extend to a maximum of 1.5(IQR) beyond the box. Values 1.5(IQR) to 3(IQR) are labeled with and are termed outliers. Values beyond 3(IQR) are labeled with ⇤ and are termed extremes. Kurtosis – a measure of the degree to which a distribution is “peaked” or flat in comparison to a normal distribution whose graph is characterized by a bellshaped distribution. The names of 3 basic types of curves are given below. n X n (xi Kurtosis = ✓ ni=1 X (xi i=1 Summary x)4 x)2 ◆2 3= n X n (xi x)4 i=1 (n 1)2s4 3. In describing the center and dispersion of a data distribution, one usually either provides the mean and standard deviation or the five-number summary, the choice depending on the shape of the distribution – mean and standard deviation for symmetric data and the five-number summary for non-symmetric data. Maple. See centdist.mw and centdist.pdf.