* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Ch 6A Random Sampling & Data Descriptions
Survey
Document related concepts
Transcript
Chapter 6 - Random Sampling and Data Description Experience the joy of dealing with large quantities of data Chapter 6A This Week in Prob/Stat Today’s Discussion + Bonus Material Descriptive Statistics Distributions Histogram Cumulative frequency distribution Frequency distribution (continuous data) Measures of Central Tendency (location) Mean Median Mode Measures of Variability (dispersion) Variance (standard deviation) Range Quartiles Coefficient of Variation Measures of skewness Measures of Kurtosis Histograms Data gets placed into class intervals, cells, or bins (synonyms). Continuous data - Number of bins ~ sqrt(nobs) or use Sturges rule. Histogram shows the relative frequency of the sample observations in each class. Histogram ~ probability density (or mass) function By summing counts in the succession of bins you can construct a cumulative frequency plot. Cumulative frequency plot ~ empirical distribution function ~ cumulative distribution function A Discrete Example Raw data: the number of accident claims received per day over the last 50 days by the Nofrills Insurance Co. Bin Frequency Cumulative % 0 8 16.00% 1 13 42.00% 2 11 64.00% 3 8 80.00% 4 6 92.00% 5 2 96.00% 6 1 98.00% 7 1 100.00% week 1 2 3 4 5 6 7 8 9 10 Mon 4 0 5 1 3 2 1 7 3 2 Tues 3 3 2 1 4 0 4 2 3 1 Wed 1 0 0 0 2 0 4 2 1 1 Thur 1 0 2 1 3 1 5 3 4 2 Fri 4 1 0 1 3 2 2 6 1 2 Frequency 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 Number of claims 5 6 7 A Discrete Empirical Cumulative Frequency Distribution Number of claims Cumulative Frequency 0x<1 1x<2 2x<3 3x<4 4x<5 5x<6 6x<7 7 x << ?∞ 16% 42% 64% 80% 92% 96% 98% 100% A Discrete Empirical Cumulative Frequency Distribution Graph Cumulative % 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 1 2 3 4 5 Number of Claims 6 7 8 A Continuous Data Example Raw Data: Time to repair or replace in hours a failed transformer by the Dayton Power and Light Company Industry standard is 2.5 hours 2.2 1.7 2.4 2.5 2.9 4.4 5.0 1.8 3.2 2.2 1.9 4.0 5.0 5.6 2.5 4.5 3.7 4.3 4.3 2.4 3.6 2.5 1.9 2.7 4.5 3.3 3.9 2.7 3.0 3.9 3.3 2.0 1.6 1.6 2.9 data collection: 35 repairs performed between 01/01/07 and 06/30/07 Sturges’ rule for grouping data k = 1 + 3.3 log10 n where k = number of classes, n = sample size. x = integer part of x For example, n 35 50 500 5000 k 6 7 10 13 n 6 7 22 71 A Histogram Data was generated from a lognormal distribution Bin x <= 1 1<x<=2 2<x<=3 3<x<=4 4<x<=5 5<x frequency 0 0.2 0.34286 0.22857 0.17143 0.05714 0.4 0.3 0.2 0.1 0 x <= 1 1<x<=2 2<x<=3 3<x<=4 4<x<=5 Transformer repair times in hours 5<x Frequency Polygon 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 Repair time in hours 6 7 8 Cumulative Frequency Distribution - ogive Cumulative Frequency 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 0 1 2 3 4 Repair Times 5 6 7 Measures of Central Tendency – i.e. averages Seeking the middle ground Types of Data nominal (also categorical or discrete) (e.g. group employees by job type) only comparisons are equality and inequality. ordinal (e.g. rank colleges based surveys and interviews) the numbers assigned to objects represent the rank order (1st, 2nd, 3rd etc.) of the entities measured. comparisons of greater and less can be made, in addition to equality and inequality. interval (e.g. temperature, IQ measurements) no "less than" or "greater than" relations among the classifying names no operations such as addition or subtraction have all the features of ordinal measurements, equal differences between measurements represent equivalent intervals. operations such as addition and subtraction are therefore meaningful. Ratio (e.g. group travel times into intervals) have all the features of interval operations such as multiplication and division are therefore meaningful. The zero value on a ratio scale is non-arbitrary 6-1 Numerical Summaries Definition: Sample Mean Characteristics of the mean most widely known and used average an artificial concept since it may not coincide with any actual value affected by every value of every item therefore uses all the information available in the sample highly influenced by extreme values can be computed directly from the raw data e.g. does not need to be sorted as does the median requires interval or ratio data lends itself better to algebraic analysis than other measures of central tendency has some desirable statistical properties answers the question, "if all the quantities had the same value, what would that value have to be in order to achieve the same total?" Example 6-1 6-1 Numerical Summaries Figure 6-1 The sample mean as a balance point for a system of weights. Population Mean For a finite population with N measurements, the mean is The sample mean is a reasonable estimate of the population mean. Sample Median Median is a measure of central tendency such that half of the values in a sample are below it and half are above it. If the number of observations is even, then average the two central values. Sample median less influenced by ‘outliers’ than the sample mean. • Not affected by extreme values • affected by the number but not the value of extremes widely used in skewed distributions where the mean would be distorted by extreme values • e.g. economic data Can be used where the data is ranked but not measured quantitatively unreliable if the data do not cluster at the center of the distribution Order Statistics Define X(1) = Min {X1, X2, …, Xn} X(2) = 2nd smallest {X1, X2, …, Xn} X(i) = ith smallest {X1, X2, …, Xn} X(n) = Max {X1, X2, …, Xn} Therefore X(1) X(2) X(3) … X(n) X med X ( k ) if n 2k 1 is odd X ( k ) X ( k 1) if n 2k is even 2 Median Repair Time Raw Data: Time to repair or replace in hours a failed transformer by the Dayton Power and Light Company Industry standard is 2.5 hours 2.2 1.7 2.4 2.5 2.9 4.4 5.0 1.8 3.2 2.2 1.9 4.0 5.0 5.6 2.5 4.5 3.7 4.3 4.3 2.4 3.6 2.5 1.9 2.7 4.5 3.3 3.9 2.7 3.0 3.9 3.3 2.0 1.6 1.6 2.9 sort data Observation number 1 1.6 2 1.6 3 1.7 4 1.8 5 1.9 6 1.9 7 2.0 8 2.2 9 2.2 10 2.4 11 2.4 12 2.5 13 2.5 14 2.5 15 2.7 16 2.7 17 2.9 18 2.9 19 3.0 20 3.2 21 3.3 22 3.3 23 3.6 24 3.7 25 3.9 26 3.9 27 4.0 28 4.3 29 4.3 30 4.4 31 4.5 32 4.5 33 5.0 34 5.0 35 5.6 Median of an even number of observations observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 value 27.40 9.08 165.29 214.85 98.70 76.07 9.87 77.96 15.01 49.86 1.18 188.07 317.26 59.79 384.63 48.74 raw data observation sort 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 value 1.18 9.08 9.87 15.01 27.40 48.74 49.86 59.79 76.07 77.96 98.70 165.29 188.07 214.85 317.26 384.63 59.79 76.07 67.93 2 average the middle two observations Good use of the median Constructively Yours is a small privately owned and operated business that specializes in small residential construction and remodeling projects. In addition to the owner-president, the company employs 8 other workers. position annual salary receptionist worker 1 worker 2 worker 3 worker 4 salesperson 1 salesperson 2 job foreman President mean median $22,050 $28,175 $29,500 $31,450 $32,800 $34,150 $38,000 $43,200 $230,000 $54,369 $32,800 Warning: The above salary information is confidential and proprietary and should not be disclosed beyond its use in the classroom. Is the Median Representative? The Makit Company is a small job shop that primarily employs machine operators and engineers. position clerk Machinist 1 Machinist 2 Machinist 3 Machinist 4 Machinist 5 Machinist 6 Machinist 7 Machinist 8 Engineer 1 Engineer 2 Engineer 3 Engineer 4 mean median annual salary $18,400 $28,175 $29,500 $31,450 $32,800 $34,150 $34,200 $35,500 $36,100 $68,500 $78,230 $85,400 $90,100 $46,347 $34,200 Warning: The above salary information is confidential and proprietary and should not be disclosed beyond its use in the classroom. Mode The most frequent value assumed by a random variable or occurring in a sample. The term is applied both to probability distributions and to collections of data. The mode is not necessarily unique, since the same maximum frequency may be attained at different values. The worst case is given by the uniform distributions in which all values are equally likely. For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6. The mode of a discrete probability distribution is the value x at which its probability mass function takes its maximum value. the value that is most likely to be sampled. The mode of a continuous probability distribution is the value x at which its probability density function attains its maximum value. Not affected by extreme values Can be computed from nominal data Example – Sample Mode week 1 2 3 4 5 6 7 8 9 10 Mon 4 0 5 1 3 2 1 7 3 2 Tues 3 3 2 1 4 0 4 2 3 1 Wed 1 0 0 0 2 0 4 2 1 1 Thur 1 0 2 1 3 1 5 3 4 2 Fri 4 1 0 1 3 2 2 6 1 2 Raw data: the number of accident claims received per day over the last 50 days by the Nofrills Insurance Co. Bin Frequency 0 8 1 13 2 11 3 8 4 6 5 2 6 1 7 1 Mode = 1 geometric mean the two means are equal if and only if all members of the data set are equal allows the definition of the arithmetic-geometric mean, a mixture of the two which always lies in between Used to determine "average factors" x1 x2 xn The geometric mean is smaller than or equal to the arithmetic mean n If a stock rose 10% in the first yr, 20% in the second yr and fell 15% in the third yr, then compute the geometric mean of the factors 1.10, 1.20 and 0.85 as (1.10 × 1.20 × 0.85)1/3 = 1.0391... and conclude that the stock rose 3.91 percent per year, on average. answers the question, "if all the quantities had the same value, what would that value have to be in order to achieve the same product?" n harmonic mean is appropriate for situations when the average of rates is desired 1 1 1 ... x1 x2 xn if for half the distance of a trip you travel at 40 mph per hour and for the other half of the distance you travel at 60 mph per hour, then your average speed for the trip is given by the harmonic mean of 40 and 60, which is 48; that is, the total amount of time for the trip is the same as if you traveled the entire trip at 48 mph per hour. If you had traveled for half the time at one speed and the other half at another, the arithmetic mean, in this case 50 mph per hour, would provide the correct average. In finance, used to calculate the average cost of shares purchased over a period of time. an investor purchases $1000 worth of stock every month for three months. If the spot prices at execution time are $8, $9, and $10, then the average price the investor paid is $8.926 per share. However, if the investor purchased 1000 shares per month, the arithmetic mean would be used midrange and beyond xmin xmax 2 It is highly sensitive to outliers and ignores all but two data points; therefore it is rarely used in statistical analysis. While the mean of a set of values minimizes the sum of squares of deviations and the median minimizes the average absolute deviation, the midrange minimizes the maximum deviation. For a given data set, the harmonic mean is always the least of the three, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between Measures of Dispersion The search for variability Definition: Sample Variance Figure 6-2 How Does the Sample Variance Measure Variability? How the sample variance measures variability through the xi x deviations . Example 6-2 Table 6-1 Computational Form of s2 Population Variance When the population is finite and consists of N values, we may define the population variance as The sample variance is a reasonable estimate of the population variance. Homing in on the Sample Range Example measures Raw Data: Time to repair or replace in hours a failed transformer by the Dayton Power and Light Company 2.2 1.7 2.4 2.5 2.9 4.4 5.0 1.8 3.2 2.2 1.9 4.0 min max mean median std dev range 5.0 5.6 2.5 4.5 3.7 4.3 4.3 2.4 3.6 2.5 1.9 2.7 4.5 3.3 3.9 2.7 3.0 3.9 3.3 2.0 1.6 1.6 2.9 1.6 5.6 3.1 2.9 1.10 4.0 = 5.6 – 1.6 Quartiles A quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents 1/4th of the sampled population. Thus: first quartile (designated Q1) = lower quartile = cuts off lowest 25% of data = 25th percentile second quartile (designated Q2) = median = cuts data set in half = 50th percentile third quartile (designated Q3) = upper quartile = cuts off highest 25% of data, or lowest 75% = 75th percentile The difference between the upper and lower quartiles is called the interquartile range. More Data Features When an ordered set of data is divided into four equal parts, the division points are called quartiles. The first or lower quartile, q1 , is a value that has approximately one-fourth (25%) of the observations below it and approximately 75% of the observations above. The second quartile, q2, has approximately one-half (50%) of the observations below its value. The second quartile is exactly equal to the median. The third or upper quartile, q3, has approximately three-fourths (75%) of the observations below its value. As in the case of the median, the quartiles may not be unique. 6-2 Example of Data Features The compressive strength data in Table 6-2 contains n = 80 observations. Minitab software calculates the first and third quartiles as the(n + 1)/4 and 3(n + 1)/4 ordered observations and interpolates as needed. For example, (80 + 1)/4 = 20.25 and 3(80 + 1)/4 = 60.75. Therefore, Minitab interpolates between the 20th and 21st ordered observation to obtain q1 = 143.50 and between the 60th and 61st observation to obtain q3 =181.00. Data Features • The interquartile range is the difference between the upper and lower quartiles, and it is sometimes used as a measure of variability. • In general, the 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100(1 - k)% of them are above it. Examples in Variability Professor Higgins has experienced considerable variability in his driving time from home to the University. Help the good professor find a measure of his variability. value 44.0 41.7 34.6 44.8 21.8 26.0 28.9 27.9 38.9 37.0 23.3 32.5 45.9 38.9 42.4 31.4 sorted driving times in minutes observation value 1 21.8 2 23.3 3 26.0 4 27.9 5 28.9 6 31.4 7 32.5 8 34.6 9 37.0 10 38.9 11 38.9 12 41.7 13 42.4 14 44.0 15 44.8 16 45.9 Quartiles Q1 Q2 Q3 variance std dev range interquartile range true median mean 62.1 7.88 24.1 13.8 35.8 35.0 Coefficient of Variation Where should the Vary A. Schun Company direct its efforts to reduce the variability in its production lead-time? Data source gear assembly Jeff Jerry Housing assembly Judy Jared Julie Final Assembly Jane Jim John mean std dev s CV 100 X CV 1.65 1.73 0.088 0.075 5.33 4.34 4.23 5.67 4.78 1.02 0.99 0.85 24.11 17.46 17.78 34.56 37.58 32.1 2.45 2.05 2.11 7.09 5.46 6.57 unit production times in minutes Real Bonus Material Descriptive Statistics for the Overachieving Student Skewness and Kurtosis n Mj j ( x x ) i i 1 n M3 M3 ˆ 1 3/2 3 M2 M4 M4 ˆ 2 2 4 M2 j 1, 2,3, 4 Moments about the mean. For example, variance is the second 1 is Skewness – third moment about the mean 2 is Kurtosis – the fourth moment about the mean Note how a power of the sample variance is used to ‘standardize’ the 1 and 2 estimates. Skewness Measures the direction and degree of departure from symmetry If the distribution is perfectly symmetrical, the measure of skewness will be zero Normal distribution uniform and rectangular If the distribution is asymmetrical (i.e. skewed), the tail of the distribution will extend in the direction of the positive (negative) numbers if the measure of skewness is positive (negative) Both distributions have the same expectation and variance. The one on the left is positively skewed. The one on the right is negatively skewed. Kurtosis The extent of peakedness in the distribution Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution (kurtosis = 3 - mesokurtic). Data with high (positive) kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data with low (negative) kurtosis tend to have a flat top near the mean. A uniform distribution would be the extreme case. Higher kurtosis means more of the variance is due to infrequent extreme deviations, as opposed to frequent modestly-sized deviations. If a random variable’s kurtosis is greater than 3, it is said to be leptokurtic. If its kurtosis is less than 3, it is said to be platykurtic. The distribution on the right has higher kurtosis than the one on the left. It is more peaked at the center, and it has fatter tails.