Download 6 Random Sampling and Data Description

Statistics 601 6 Random Sampling and Data Description Parameters are numerical characteristics of a population. Statistics are numerical quantities calculated from the sample. We will now use statistics to gain an understanding of the sample data. We will develop both graphical and numerical methods of summarizing data. Before we can use statistics to draw inferences about population parameters, we will need to learn about statistical models which have probability theory as their basis. Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 1 Statistics 601 6.1 Pictorial and Tabular Methods in Descriptive Statistics Consider the Following Data Set: The concentration of suspended solids in river water is an important environmental characteristic. The paper “Water Quality in Agricultural Watershed: Impact of Riparian Vegetation During Base Flow” (Water Resources Bull., 1981, pp. 233-239) reported on concentrations (in parts per million, or ppm) for several different rivers. Suppose the following 50 observations had been obtained for a particular river. 55.8 45.9 83.2 75.3 60.7 60.9 39.1 40.0 71.4 77.1 37.0 35.5 31.7 65.2 59.1 91.3 56.0 36.7 52.6 49.5 65.8 44.6 62.3 58.2 69.3 42.3 71.7 47.3 48.0 69.8 33.8 61.2 94.6 61.8 64.9 60.6 61.5 56.3 78.8 27.1 76.0 47.2 30.0 39.8 87.1 69.0 74.5 68.2 65.0 66.3 Question: What do these data tell us about the concentration of suspended solids? First few steps in analyzing a data set: 1. Organize and summarize the data. 2. Find the center of the data. 3. Examine the spread of the data. Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 2 Statistics 601 6.2 Stem and Leaf Display A compact and descriptive method of organizing data without losing any information in the data. • Leading digits are stems. • Trailing digits are leaves. • Indicate units somewhere on the display. • Option: Sort the leaves. • Comparative stem & leaf. • Repeat stems if need be. Advantages: • No loss of information. • Easy to do for small data sets. Disadvantages: • Time consuming for large data sets (by hand) • Cannot be used for categorical data. • Very space consuming for large data sets. Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 3 Statistics 601 Stem-and-leaf display of the solids data set with sorted leaves: 2 3 4 5 6 7 8 9 : : : : : : : : 7 0245779 002567789 366689 111112255566899 01245679 37 15 units: ppm Stem-and-leaf display with two stems per tens place: 2*: 3 : 3*: 4 : 4*: 5 : 5*: 6 : 6*: 7 : 7*: 8 : 8*: 9 : 9*: 7 024 5779 002 567789 3 66689 1111122 55566899 0124 5679 3 7 1 5 Chapter 6: Random Sampling and Data Description units: ppm c 2007 by Thomas E. Wehrly Copyright ° Slide 4 Statistics 601 Comparative stem-and-leaf display on the solids data set taken two years earlier: Two Years Ago Current ------------------------------------8 : 1 : 9851 : 2 : 7 9887640 : 3 : 0245779 9997765322111 : 4 : 002567789 877554200 : 5 : 366689 9887653221 : 6 : 111112255566899 72210 : 7 : 01245679 95 : 8 : 37 : 9 : 15 units: ppm Sometimes we redefine the leaves for low-numbered or ”narrow” data sets: 58, 58, 57, 54, 54, 54, 57, 57, 56, 56, 57, 51, 58, 54, 52, . . . , 52, 54 60 59 58 57 56 55 54 53 52 51 : : : : : : : : : : 0 00 00000000000 0000000000 0000000000 0000000000000 0000000000000 0000 000 0 Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 5 Statistics 601 6.3 Frequency Distributions for Quantitative Data A very popular way to summarize data is with a frequency distribution. A frequency distribution is a compact summary of a data set using a table with 3 or 4 columns: Class interval (or category) — disjoint intervals containing all observations in the data set Frequency — Number of obs. in a class interval = f Relative frequency — Proportion of obs. in interval = f /n Cumulative frequency — Sum of the relative frequencies Pclass i=1 f /n. Having too many intervals leads to a very jagged histogram. Having too few intervals smooths away important features. The number of classes is usually 5 to 20. Use at least (2n)1/3 for a rough idea. Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 6 Statistics 601 We will form a frequency distribution for the solids data set: 55.8 45.9 83.2 75.3 60.7 60.9 39.1 40.0 71.4 77.1 37.0 35.5 31.7 65.2 59.1 91.3 56.0 36.7 52.6 49.5 65.8 44.6 62.3 58.2 69.3 50 observations. Approximate number of classes: 42.3 71.7 47.3 48.0 69.8 √ 33.8 61.2 94.6 61.8 64.9 60.6 61.5 56.3 78.8 27.1 76.0 47.2 30.0 39.8 87.1 69.0 74.5 68.2 65.0 66.3 50 = 7.07. Frequency Relative f Cumulative f 20–29.9 1 .02 .02 30–39.9 8 .16 .18 40–49.9 8 .16 .34 50–59.9 6 .12 .46 60–69.9 16 .32 .78 70–79.9 7 .14 .92 80–89.9 2 .04 .96 90–99.9 2 .04 1.0 Class Interval Chapter 6: Random Sampling and Data Description [Tally] c 2007 by Thomas E. Wehrly Copyright ° Slide 7 Statistics 601 6.4 Histogram A histogram is a pictorial representation of a frequency distribution. 1. Draw an x-axis and mark class intervals. 0 5 10 15 2. Draw a rectangle whose area is proportional to the frequency of that interval. 20 40 60 80 100 solids Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 8 Statistics 601 A true histogram or a density scale will have an area that is equal to 1.0. In that case we make the: Rectangle Height = Relative Frequency Base Length In the case where all the intervals are of equal length all we need to do is add the appropriately 0.0 0.005 0.010 0.015 0.020 0.025 0.030 labeled y-axis. 20 40 60 80 100 solids Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 9 Statistics 601 Histograms often exhibit particular shapes: • unimodal • bimodal • multimodal • symmetric • positively skewed • negatively skewed Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 10 Statistics 601 6.5 Measures of Location Another step in gaining understanding of our data is to find the “center” of our data. What is the center? 6.6 Mean / Average We calculate the sample mean or average as follows: x̄ = xi : The ith observation in the sample. n: Sample size. 1 n Pn i=1 xi Example: Calculate the average concentration of solids. 50 1 1X × 2927 = 58.54 xi = x̄ = n i=1 50 Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 11 Statistics 601 6.7 Median Median: The middle observation of the sorted data set. Sample Median = x̃ We calculate the median: n odd: n even: x̃ = x((n+1)/2) x̃ = (x(n/2) + x((n+2)/2) )/2 Example: Calculate the median of the solid concentrations. 27.1 42.3 56.3 64.9 71.7 30.0 44.6 58.2 65.0 74.5 31.7 45.9 59.1 65.2 75.3 33.8 47.2 60.6 65.8 76.0 35.5 47.3 60.7 66.3 77.1 36.7 48.0 60.9 68.2 78.8 37.0 49.5 61.2 69.0 83.2 39.1 52.6 61.5 69.3 87.2 39.8 55.8 61.8 69.8 91.3 40.0 56.0 62.3 71.4 94.6 n/2 = 25, x̃ = (x(25) + x(26) )/2 = (60.7 + 60.9)/2 = 60.8 Discussion: How do outliers affect the mean and median? Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 12 Statistics 601 6.8 6.8.1 Other Measures of Location Trimmed Mean A trimmed mean is a compromise between x̄ and x̃ in that outliers will have some effect on the trimmed mean but not as much as they have on the mean. It is calculated by eliminating a certain percentage of the observations from both ends and calculating the average of the remaining data. For example a 10% trimmed mean would eliminate 10% of the observations from each end of the data (20% total) and average the remaining 80% of the observations. Example: Calculate the 10% trimmed mean for the solid concentrations. We have n = 50 observations. 10% of this is 50 × .10 = 5. Therefore we eliminate 5 observations from each end for a total of 10 observations: 45 1 1 X xi = × 2333.8 = 58.345 x̄ = 40 i=6 40 Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 13 Statistics 601 6.8.2 Percentiles and Quartiles The 100pth percentile is the observation in our data set where 100p% are equal to or less than this observation. The median is the 50th percentile. The following is a general approach to calculate the 100pth percentile x[p] : 1. Let x(i) , i 2. Let ip = 1, . . . , n, refer to our data set in ascending order. = np. 3. Find the first index i such that i > ip . 4. The 100pth percentile is then:   x[p] = x(i−1) +x(i) 2  x(i) In short: If ip is integer we average the ith p and (ip if i − 1 = ip otherwise + 1)th observation. Otherwise we round ip up and take the dip eth observation. Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 14 Statistics 601 Q1 : First Quartile = 25th percentile Q2 : Second Quartile = 50th percentile Q3 : Third Quartile = 75th percentile IQR = Q3 − Q1 = “Interquartile Range” We can calculate quartiles by using our rules for finding the median. We consider two cases: • n even: – To obtain Q1 , obtain the median of x(1) , . . . , x(n/2) . – To obtain Q3 , find the median of x((n/2)+1) , . . . , x(n) • n odd: – To obtain Q1 , obtain the median of x(1) , . . . , x((n+1)/2) . – To obtain Q3 , find the median of x((n+1)/2) , . . . , x(n) Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 15 Statistics 601 Example: Calculate Q1 and Q3 for the solid concentrations. 27.1 42.3 56.3 64.9 71.7 30.0 44.6 58.2 65.0 74.5 31.7 45.9 59.1 65.2 75.3 33.8 47.2 60.6 65.8 76.0 35.5 47.3 60.7 66.3 77.1 36.7 48.0 60.9 68.2 78.8 37.0 49.5 61.2 69.0 83.2 39.1 52.6 61.5 69.3 87.2 39.8 55.8 61.8 69.8 91.3 40.0 56.0 62.3 71.4 94.6 Q1 =median of {x(1) , . . . , x(25) } = x(13) = 45.9 Q3 =median of {x(26) , . . . , x(50) } = x(38) = 69.3 Example: Calculate Q1 and Q3 for the values {2, 4, 9, 17, 22, 43, 65, 88, 103}. n=9 Q1 =median of {x(1) , . . . , x(5) } = x(3) = 9 Q3 =median of {x(5) , . . . , x(9) } = x(7) = 65 Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 16 Statistics 601 6.8.3 Boxplots Box plots are useful in summarizing various aspects of the data. Side-by-side box plots provide useful comparisons of two or more sets of data. 1. Form an axis that includes all possible values of the data. 2. Draw a box extending from Q1 to Q3 . 3. Draw a vertical bar at the median. 4. Draw whiskers (horizontal lines) to the most extreme observation within 1.5 IQR from each end of the box. 5. Indicate mild outliers with a “◦” 6. Indicate extreme outliers with a “∗” Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 17 Statistics 601 Example: Calculate the summary statistics x̄, x̃, Q1, Q3 for the water quality data set. Then construct a box plot. x̄ = 58.54 Min = 27.10 Q1 = 45.9 X̃ = 60.80 Q3 = 69.3 Max = 94.60 IQR = 69.3 − 45.9 = 23.4 60 40 solids(ppm) 80 Particulate Matter Chapter 6: Random Sampling and Data Description c 2007 by Thomas E. Wehrly Copyright ° Slide 18 Statistics 601 6.9 Measures of Variability The mean, median, etc. do not give us a complete overview (summary) of our data. For Example: Consider the following three data sets: Data Measures of Spread 1: 20 30 40 50 60 70 50 30 350 18.71 2: 20 43 44 46 47 70 50 4 252 15.87 3: 40 43 44 46 47 50 10 4 12 3.46 – The mean and median are 45 for all three data sets. – These data sets have very different spreads. Ways to measure spread: Range: range = maximum observation – minimum observation Interquartile Range: Chapter 6: Random Sampling and Data Description IQR = Q3 − Q1 c 2007 by Thomas E. Wehrly Copyright ° Slide 19 Statistics 601 Average Deviation from the Mean: We define the ith deviation to be: We average the deviations: xi − x̄. Intuitive: 1X (xi − x̄) n Problem: this does not give us anything useful! 1 n P (xi − x̄) = 1 n P xi − 1 n P x̄ = 1 n P xi − n1 nx̄ = x̄ − x̄ = 0 The result is always equal to zero! Variance: We average the squared deviations from the mean and divide by n − 1 instead of n to get a measure of spread called the sample variance: 1 X s = (xi − x̄)2 n−1 2 Calculation formula: s2 = Chapter 6: Random Sampling and Data Description 1 n−1 ³X ¡P x2i − c 2007 by Thomas E. Wehrly Copyright ° xi n ¢2 ´ Slide 20 Statistics 601 Standard Deviation: The units of the variance are units of the data squared. To make the units the same as that of the data set we take the square root of the variance. This is called the sample standard deviation: s= √ s2 Note: s is translation invariant: s(x1 , ..., xn ) = s(x1 + a, ..., xn + a) for all a. s is scale equivariant: s(ax1 , ..., axn ) = |a|s(x1 , ..., xn ) for all a. Example: Calculate the range, variance and standard deviation of the particulate solid data. Range = Maximum − Minimum = 94.6 − 27.1 = 67.5 s2 = 270.8469 s= Chapter 6: Random Sampling and Data Description √ 270.8469 = 16.8469 c 2007 by Thomas E. Wehrly Copyright ° Slide 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 6 Random Sampling and Data Description