Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 281: Ch. 2--Presenting Data An engineer, consultant and statistician were driving down a steep mountain road. Suddenly, the brakes failed and the car careened down the road out of control. But half way down, the driver somehow managed to stop the car by running it against the embankment, narrowly avoiding going over a very steep cliff. They all got out, shaken, but otherwise unharmed. The consultant said: "To fix this problem we need to organize a committee, have meetings, write several interim reports and develop a solution through a continuous improvement process." The engineer said: "No! That would take too long, and besides that method has never really worked. I have my trusty penknife here and will take apart the brake system, isolate the problem and correct it." The statistician said: "No - you're both wrong! Let's all push the car back up the hill and see if it happens again. We only have a sample size of 1 here!!" Fizzy Cola Sales (Showing first 8 of 50) Employee Gallons Sold P.P. 95.00 S.M. 100.75 P.T. 126.00 P.U. 114.00 M.S. 134.25 F.K. 116.75 L.Z. 97.50 F.E. 102.25 The Goal Display data in ways that elucidate the information contained in them Raw Data actually contains all the information available, but it may not be easy to understand It’s not so much the information available that counts—it’s the information you get out! Ranked Fizzy Cola Sales Rank Empl. Gal. Sold Rank Empl. Gal. Sold 1 T.T. 82.50 43 R.O. 133.25 2 A.D. 88.50 44 M.S. 134.25 3 E.I. 91.00 45 O.U. 135.00 4 A.S. 93.25 46 G.H. 135.50 5.5 P.P. 95.00 47 R.T. 136.00 5.5 E.Y. 95.00 48 A.T. 137.00 7 L.Z. 97.50 49 O.O. 144.00 8 T.N. 99.50 50 R.N. 148.00 Viewing Data Directly Ranked Data (aka an Array) – Still contains all the information – Can quickly see range (max and min) – May also easily determine median, quartiles, etc. Stem and Leaf – Arranges ranked data into chart-like form Fizzy Cola Stem & Leaf 8 28 9 135579 10 0234556789 11 02344555667889 12 124455688 13 2345567 14 48 More Complex Stem & Leaf (MiniTab Style) Stem-and-Leaf of C1 N=16 Leaf Unit=0.010 1 59 7 4 60 148 (5) 61 02669 7 62 0247 3 63 58 1 64 3 Dot Plot for Fizzy Cola Sales Dot plots display vertically stacked dots for each data value. They tend to bring out any “clustering” behavior in the data. Stem & Leaf and Dot Plots begin to give us a picture of the Distribution of Data. Summarized Data Frequency Tables – Grouped or ungrouped – Frequency Distribution – Relative Frequency Distribution Bar Graphs Histogram (Numeric Data Only) Pie Charts Often used for Categorical Data Fizzy Cola Frequency Table Number of Employees in each Sales Range Gallons Sold 80-90 Employees 2 >90-100 >100-110 >110-120 6 10 14 >120-130 >130-140 >140-150 9 7 2 Histogram of Fizzy Cola Sales Constructing a Histogram 1. Identify the high (H) and low (L) scores. Find the range. Range = H - L. 2. Select a number of classes and a class width so that the product is a bit larger than the range. 3. Pick a starting point a little smaller than L. Count from L by the width to obtain the class boundaries. Observations that fall on class boundaries are placed into the class interval to the right. Note: 1. The class width is the difference between the upper- and lower-class boundaries. 2. There is no best choice for class widths, number of classes, or starting points. Terms Used With Histograms Symmetrical: The sides of the distribution are mirror images. There is a line of symmetry. Uniform (rectangular): Every value appears with equal frequency. Skewed: One tail is stretched out longer than the other. The direction of skewness is on the side of the longer tail (Positively vs. negatively skewed). J-shaped: There is no tail on the side of the class with the highest frequency. Bimodal: The two largest classes are separated by one or more classes. Often implies two populations are sampled. Normal: The distribution is symmetric about the mean and bell-shaped. Bimodal Distribution Frequency 15 10 5 0 4.2 5.2 6.2 7.2 Blood Test 8.2 9.2 Left-Skewed Distribution Ages of Nuns Frequency 200 100 0 25 35 45 55 Age 65 75 85 Distribution of Categorical Data Cars Sold in One Week Day Monday Tuesday Wednesday Thursday Friday Saturday Number Sold 15 23 35 11 12 42 Basic Pie Chart Cars Sold in One Week Monday 11% Saturday 30% Friday 9% Thursday 8% Tuesday 17% Wednesday 25% Pie Charts focus our attention on fractions of the whole, especially for the largest classes. Three-D Pie Chart Cars Sold in One Week Saturday 30% Monday 11% Friday 9% Thursday 8% Tuesday 17% Wednesday 25% Three-D Pie Charts are “pretty” but can also be used to distort the image. Manipulating 3-D Pie Charts Cars Sold in One Week Thursday 8% Friday 9% Wednesday 25% Saturday 30% Tuesday 17% Monday 11% Changing the angle or turning the pie may affect our perception of size. Bar Charts for Categorical Data Cars Sold in One Week 45 40 35 30 25 20 15 10 5 y S at ur da rid ay F hu rs da y T ay W e dn es d ue sd ay T M on d ay 0 (Bar charts for categorical data are drawn with bars separated, while bars in histograms touch.) Manipulating Bar Charts Cars Sold in One Week 40 35 30 25 20 15 Sa tu rd ay Fr id ay Th ur sd ay W ed ne sd ay Tu es da y M on da y 10 Cutting off the vertical axis distorts our perception of the differences between bars. Manipulating Bar Charts Cars Sold in One Week 42 35 23 Sa tu rd ay 12 Fr id ay Th ur sd ay W ed ne sd ay 11 Tu es da y M on da y 15 Removal of labels on the vertical axis allows bars to be stretched upward to hide the differences. Hmmm… It is proven that the celebration of birthdays is healthy. Statistics show that people who celebrate the most birthdays become the oldest. In earlier times, they had no statistics, so they had to fall back on lies. (Stephen Leacock) Measures of Central Tendency Statistics used to locate the middle of a set of numeric data, or where the data is clustered. The term average may be associated with all measures of central tendency. The mode for discrete data is the value that occurs with greatest frequency. The modal class of a histogram is the class with the greatest frequency. A bimodal distribution has two highfrequency classes separated by classes with lower frequencies. Summation Notation 5 i 1 2 3 4 5 15 i 1 5 2 i 1 4 9 16 25 55 i 1 n x i 1 i x1 x2 xn The Mean Mean: The “regular” average. The sum of all the values divided by the total number of values. The population mean, m, (lowercase Greek mu) is the mean of all x values for the population. It is a parameter of the distribution. 1 N 1 m xi ( x1 x2 xN ) N i 1 N We usually cannot measure m but would like to estimate its value. The Sample Mean The sample mean, x, (read x-bar) is the mean of all x values for the sample. It is a statistic. 1 n 1 x xi ( x1 x2 n i 1 n xn ) The mean can be greatly influenced by outliers. E.g. Bill Gates moves to town. Median Median: The value of the data that occupies the middle position when the data are ranked according to size. The sample median (statistic) may be denoted by “x tilde”: ~x . The population median (parameter), M, (uppercase Greek mu), is the data value in the middle of the population. To find the median: 1. Rank the data. 2. Determine the depth of the median. d ( ~ x ) n 1 2 3. Determine the value of the median. Mode Mode: The mode is the value of x that occurs most frequently. Note: If two or more values in a sample are tied for the highest frequency (number of occurrences), there is no mode. Note: Mode, as defined here, is most applicable to categorical or discrete data. The mode for continuous data is defined differently. Other Measures of Center Midrange: The number midway between the maximum and minimum data values. It is found by averaging the max and min. Midquartile: Oops, we haven’t defined quartiles yet. But this is the average of the first and third quartile instead of the max and min. Dispersion How spread apart are the data? Two populations with the same mean can have very different distributions—would like to take measure spread somehow. Range (max-min) – Values in middle are ignored – Dispersion of middle could be very different Use the idea of deviation from the mean: – MAD – Variance – Standard Deviation x Deviations from the Mean 8 deviations 7 6 5 4 3 mean 2 x-values 1 0 0 1 2 3 4 5 6 7 Observation Number 8 9 10 11 Some example data Obs 1 Data x 2 2 4 3 5 4 9 Total Calculate the mean Obs 1 Data x 2 Mean x 5 2 4 5 3 5 5 4 9 5 Total 20 Deviation From the Mean Obs 1 Data x 2 Mean x 5 Deviation x- x -3 2 4 5 -1 3 5 5 0 4 9 5 4 Total 20 20 0 Mean Absolute Deviation (MAD) Obs 1 Data Mean Deviation Absolute Deviation x x- x x 2 5 -3 3 2 4 5 -1 1 3 5 5 0 0 4 9 5 4 4 Sum of Absolute Deviations 8 MAD 2 (divide sum by n) Formula 1 n Mean Absolute Deviation | xi x | n i 1 Use of Squared Deviations Obs 1 Data Mean Deviation Squared Deviation x x- x x 2 5 -3 9 2 4 5 -1 1 3 5 5 0 0 4 9 5 4 16 Sum of Squared Deviations: SS(x) Variance (Divide Sum by n-1) Standard Deviation (Take Square Root) 26 8.67 2.94 Sums of Squares The sum of squared deviations is denoted by SS(x) and often called the “Sum of Squares for x.” There are also other notations used, including SSx and Sxx n SS ( x) ( xi x ) i 1 2 Variance The Variance is the statistician’s favorite measure of dispersion, but in reports or “everyday use” the standard deviation is more commonly given. The Standard Deviation is the square root of the variance. The Variance may be thought of as the average squared deviation from the mean. For a sample, divide by n-1. For a population, divide by N. Formulas SS ( x) ( x x ) x nx x 2 2 2 2 x 2 n SS( x) Sample Variance: s n 1 2 1 2 2 Alternately: s ( x x ) n 1 2 2 x nx n 1 x Sample Standard Deviation: s s 2 2 x n 1 n 2 Formulas 1 Population Variance : N 2 (x m) 2 1 2 Population Standard Deviation: ( x m ) N Example: Find the variance and standard deviation for the data {5, 7, 1, 3, 8}. x 1(5 7 1 3 8) 48 . 5 Sum x 5 7 1 3 8 24 x2 25 49 1 9 64 148 x x 0.2 2.2 -3.8 -1.8 3.2 0 2 24 1 s2 (148 ) 8.2 4 5 s2 1 (1485(4.8)2) 8.2 4 ( x x)2 0.04 4.84 14.44 3.24 10.24 32.80 s2 1 (32.8) 8.2 4 s 8.2 2.86 Interpretation of s Need to get a sense of the meaning of different values of dispersion measures. Are units same as data or squared? Empirical Rule: 68%, 95%, 99.7% Test of Normality Range as estimator of s z-Scores Also “standardized scores” or just “standard scores.” Expresses a quantity in terms of its distance from the mean in standard deviation units. value mean x x z st.dev. s More z-Scores The z-score measures the number of standard deviations away from the mean. z-scores typically range from -3.00 to +3.00. z-scores may be used to make comparisons of raw scores. You can calculate back from z-score to raw data value by using the inverse: xx z sz x x x sz x s Percentiles Values of the variable that divide a set of ranked data into 100 equal subsets. – Each set of data has 99 percentiles. – The kth percentile, Pk, is a value such that at most k% of the data are smaller than Pk and at most (100k)% are larger. Procedure for finding Pk 1. Rank the n observations, lowest to highest. 2. Compute A = (nk)/100. 3. If A is an integer: d(Pk) = A.5 (depth) Pk is halfway between the value of the datum in the Ath position and the value of the next datum. If A is a fraction: d(Pk) = B, the next largest integer. Pk is the value of the data in the Bth position. Some programs like Excel also do interpolation Quartiles Like percentiles except dividing the data set into 4 equal subsets. The first quartile, Q1, is the same as the 25th percentile, and The third quartile, Q3, is the same as the 75th percentile. The second quartile is the 50th percentile, which is the median. Sometimes finding Q1 and Q3 is described as finding the medians of the bottom half and top half of the data, respectively. Five Number Summary The Min, Q1, Median, Q3, and Max Indicate how the data is spread out in each quarter. Interquartile Range is the distance between Q1 and Q3. The Midquartile is the average of Q1 and Q3, another measure of central tendency. Box and Whisker Plots Weights from Sixth Grade Class 60 70 80 90 Weight 100 110 Hmmm… What did the Box Plot say to the outlier? “Don’t you dare get close to my whisker!”