Download Note

Statistics for Business (Env) Chapter 3 Descriptive Statistics: Numerical Methods 1 Descriptive Statistics 3.1 Describing Central Tendency 3.2 Measures of Variation 3.3 Percentiles, Quartiles and Box-andWhiskers Displays 3.4 Covariance, Correlation, and the Least Square Line 3.5 Weighted Means and Grouped Data (Optional) 3.6 The Geometric Mean (Optional) Describing Central Tendency • In addition to describing the shape of a distribution, want to describe the data set’s central tendency – A measure of central tendency represents the center or middle of the data – It is most typical or most representative of the entire data Parameters and Statistics • A population parameter is a number calculated from all the population measurements that describes some aspect of the population • A sample statistic is a number calculated using the sample measurements that describes some aspect of the sample Measures of Central Tendency Mean,  Median, Md Mode, Mo The average or expected value The value of the middle point of the ordered measurements The most frequent value The Mean Population X1, X2, …, XN  Sample x1, x2, …, xn x Population Mean Sample Mean n N   Xi i=1 N x x i i=1 n The Sample Mean For a sample of size n, the sample mean is defined as n x x i 1 n i x1  x2  ...  xn  n and is a point estimate of the population mean  • It is the value to expect, on average and in the long run • And the amount each member gets when the total is distributed equally within the sample Mean as the balance point for a distribution Data: 2, 2, 6, 10 mean=(2+2+6+10)/4=5 8 Data: 3, 6, 6, 9, 11 mean=(3+6+6+9+11)/5=7 What will happen to the mean if we add one more number to the data? 9 The Median The median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it 1. If the number of measurements is odd, the median is the middlemost measurement in the ordering 2. If the number of measurements is even, the median is the average of the two middlemost measurements in the ordering Data: 3, 5, 8, 10, 11 median=8 11 Data: 3, 3, 4, 5, 7, 8 median=(4+5)/2=4.5 12 Data: 1, 2, 2, 3, 4, 4, 4, 4, 4, 5 median=4?? 13 14 Example for non-integer data • Example 3.1: First five observations from Table 3.1: 30.8, 31.7, 30.1, 31.6, 32.1 • In order: 30.1, 30.8, 31.6, 31.7, 32.1 • There is an odd so median is one in middle, or 31.6 Data: 2, 2, 2, 3, 3, 12 mean=4 median=(2+3)/2=2.5 16 The Mode The mode Mo of a population or sample of measurements is the measurement that occurs most frequently – Modes are the values that are observed “most typically” – Sometimes higher frequencies at two or more values • If there are two modes, the data is bimodal • If more than two modes, the data is multimodal – When data are in classes, the class with the highest frequency is the modal class • The tallest box in the histogram Histogram Describing the 50 Mileages 19 Selecting a measure of Central Tendency • Usually the mean is a good measure, because it uses every score in the distribution. • There are some extreme cases in which the mean is not representative (or calculable). Then the mode and the median are used. 20 Mean=(10+11*4+12*3+13+100)/10=20.3 Mode=11 Median=(11+12)/2=11.5 21 Mean – not computable Median=(12+13)/2=12.5 Mode – not meaningful Open-ended distributions A distribution is said to be open-ended when there is no upper limit (or lower limit) for one of the categories 22 Measures of Variation/variability • Knowing the measures of central tendency is not enough • Both of the distributions below have identical measures of central tendency 24 Measures of Variation Range Largest minus the smallest measurement Variance The average of the squared deviations of all the population measurements from the population mean Standard Deviation The square root of the variance They provide quantitative measures of the degree to which data in a distribution are spread out or clustered together. Range for discrete & continuous data • The range is the distance between the largest score (Xmax) and the smallest score (Xmin) in the distribution for discrete data. • For continuous data, you must also take into account the real limits of the maximum and minimum X values. • range = URL Xmax - LRL Xmin 26 Population Variance and Standard Deviation • The population variance (σ2) is the average of the squared deviations of the individual population measurements from the population mean (µ) • The population standard deviation (σ) is the positive square root of the population variance Variance • For a population of size N, the population variance σ2 is: N 2  2   x    i i 1 N 2 2 2  x1     x2       xN     N • For a sample of size n, the sample variance s2 is: n 2 s2   x  x  i 1 i n 1 2 2 2  x1  x   x2  x     xn  x   n 1 Sample variability tends to underestimate the population value 29 Standard Deviation • Population standard deviation (σ):    2 • Sample standard deviation (s): s s 2 Example: Sample Variance and Standard Deviation • Data points are: 60, 41, 15, 30, 34 • Mean is 36 • Variance is:  2 2 2 2 2 2  60  36  41  36  15  36  30  36  34  36  5 576  25  441  36  4 1082    216.4 5 5 Standard deviation is:   216.4  14.71 X 32 Percentiles & Quartiles • • • • For a set of measurements arranged in increasing order, the pth percentile is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value The first quartile Q1 is the 25th percentile The second quartile (or median) is the 50th percentile The third quartile Q3 is the 75th percentile The interquartile range IQR is Q3 - Q1 Cumulative percentages & PERCENTILES Q3 Q1 X=2 means that the measurement was somewhere between the real limits of 1.5 and 2.5. 30% of the individuals have been accumulated by the time you reach the top of the interval for X=2. 34 35 What is the 95th percentile? (Answer: X = 4.5.) What is the percentile rank for X = 3.5? (Answer: 70%.) What is the 50th percentile? What is the percentile rank for X = 4? estimates of these values by a standard procedure known as interpolation 36 37 38 Using the following distribution of scores, we will use interpolation to find the 50th percentile: 39 For the scores, the width of the interval is 5 points. For the percentages, the width is 50 points. The value of 50% is located 10 points from the top of the percentage interval. As a fraction of the whole interval, this is 10 out of 50, or 1/5 of the total interval. The 50th percentile is X = 8.5. 40 USING INTERPOLATION TO FIND THE MEDIAN Answer: X = 3.70 is the median Notice that this is exactly the same answer we obtained using the graphic method of interpolation in Figure 3.7 41 42 43 Example: Quartiles A slightly different way to find the quartiles (without using interpolation). 20 customer satisfaction ratings: 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 Md = (8+8)/2 = 8 Q1 = (7+8)/2 = 7.5 Q3 = (9+9)/2 = 9 IQR = Q3  Q1 = 9  7.5 = 1.5 Five Number Summary in descriptive statistic 1. 2. 3. 4. 5. The smallest measurement The first quartile, Q1 The median, Md The third quartile, Q3 The largest measurement • Displayed visually using a box-and-whiskers plot Box-and-whisker plots A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set. 46 Outliers • Outliers are measurements that are very different from other measurements – They are either much larger or much smaller than most of the other measurements • Outliers lie beyond the fences of the box-andwhiskers plot – Measurements between the inner and outer fences are mild outliers – Measurements beyond the outer fences are severe outliers Box-and-Whiskers Plots • The box plots the: – first quartile, Q1 – median, Md – third quartile, Q3 – inner fences – outer fences From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree, Box-and-Whiskers Plots Continued • Inner fences – Located 1.5IQR away from the quartiles: • Q1 – (1.5  IQR) • Q3 + (1.5  IQR) • Outer fences – Located 3IQR away from the quartiles: • Q1 – (3  IQR) • Q3 + (3  IQR) From: Business Statistics in Practice, 5th Edition, Bowerman O’Connell Murphree, Box-and-Whiskers Plots Continued • The “whiskers” are dashed lines that plot the range of the data – A dashed line drawn from the box below Q1 down to the smallest measurement between the inner fences – Another dashed line drawn from the box above Q3 up to the largest measurement between the inner fences Box-and-Whiskers Plots Continued Symmetric distribution: A distribution having the same shape on either side of the center Skewed distribution: One whose shapes on either side of the center differ; a nonsymmetrical distribution. positive negative Can be positively or negatively skewed, or bimodal Symmetric distribution: mean = median = mode Mean Median Mode The Relative Positions of the Mean, Median, and Mode: Symmetric Distribution 54 Relationships Among Mean, Median and Mode Relative Positions of the Mean, Median, and Mode Mean<Median<Mode The left tail is longer; the mass of the distribution is concentrated on the right of the figure. The distribution is also said to be left-skewed. Mode<Median<Mean The right tail is longer; the mass of the distribution is concentrated on the left of the figure. The distribution is also said to be right-skewed. Skewness is the measurement of the lack of symmetry of the distribution. coefficient of skewness can range from -3.00 The up to 3.00 when using the following formula: A value of 0 indicates a symmetric distribution. Negatively Skewed Distribution has negative coefficient of skewness. Positively Skewed Distribution has positive coefficient of skewness.  3 X  Median Sk  s Some software packages use a different formula which results in a wider range for the coefficient.  Normal distribution Normal distribution (Gaussian distribution) is a symmetric distribution that often gives a good description of data that cluster around the mean. The graph of the distribution is bell-shaped, with a peak at the mean, and is known as the Gaussian function or bell curve. Normal distribution in nature The bean machine is a device invented by Sir Francis Galton to demonstrate how the normal distribution appears in nature. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve. Normal distribution in nature Height (in.) Distribution of the heights of 1052 women fits the normal distribution, with a goodness of fit p value of 0.75 Histogram of daily percentage changes in the S&P 500 index The Empirical Rule for a Normal/Gaussian distribution • If a population has mean µ and standard deviation σ and is described by a normal distribution, then – 68.26% of the population measurements lie within one standard deviation of the mean: [µ-σ, µ+σ] – 95.44% of the population measurements lie within two standard deviations of the mean: [µ-2σ, µ+2σ] – 99.73% of the population measurements lie within three standard deviations of the mean: [µ-3σ, µ+3σ] Origin and meaning of “Six Sigma Process" Six Sigma – quality control of process outputs If one has six standard deviations between the process mean and the nearest specification limit, as shown in the graph, practically no items will fail to meet specifications. If the upper and lower specification limits (USL, LSL) are at a distance of 6σ from the mean, values lying that far away from the mean are extremely unlikely. Even if the mean were to move right or left by 1.5σ at some point in the future (1.5 sigma shift), there is still a good safety cushion. This is why if the mean is at least 6σ away from the nearest specification limit, the process will be under good quality control. Chebyshev’s Theorem • Let µ and σ be a population’s mean and standard deviation, then for any value k> 1 • At least 100(1 - 1/k2 )% of the population measurements lie in the interval [µ-kσ, µ+kσ] • Or, at most 100/k2 % of the population measurements lie outside the interval [µ-kσ, µ+kσ] Chebyshev’s Theorem: what is it? • Chebyshev’s Theorem tells us roughly the percentage of data with values that will fall within a certain number of standard deviations of the mean. • Chebyshev's Theorem applies to all distributions regardless of their shape and can therefore be used for non normal distributions and distributions where the shape is unknown. Chebyshev’s Theorem: comparision Minimum percentage of the population lie within the interval Chebyshev’s Theorem: Example One study of heights in the U.S. concluded that women have a mean height of 65 inches with a standard deviation of 2.5 inches. If this is correct, what is the percentage of women that have heights between 60 and 70 inches according to Chebyshev’s Theorem? If we assume the distribution is a normal distribution, what is the percentage of women with heights between 60 and 70 inches? Since the interval between 60 & 70 inches is 2 σ away from the mean, according to Chebyshev’s Theorem, there are at least 100(1 - 1/22 )% or 75% of women with heights between 60 and 70 inches. But if we know the distribution is a normal distribution, then there is 95.44% of the population measurements lie within two standard deviations from the mean. z Scores • For any x in a population or sample, the associated z score is defined as x  mean z standard deviation Z-score is: the exact location of a score within a distribution The z-score transforms each X value into a signed number so that 1. The sign tells whether the score is located above (+) or below (-) the mean, and 2. The number tells the distance between the score and the mean in terms of the number of standard deviations. 69 Z=(76-70)/3=2 Z=(76-70)/12=0.5 Two distributions of exam scores. For both distributions,  = 70, but for one distribution, σ = 3, and for the other, σ = 12. The position of X = 76 is very different for these two distributions. 70 z Scores: application What if you score a 65 on a test? On the surface it might look bad. But what if that was the mean is 45 and the standard deviation is 10. Then your z-Score is Z = (65 – 45)/10 = 2. That means you are 2 σ above the mean value. From Chebyshev’s Theorem, you are at worse in the top 12.5% of the class if the distribution is symmetric. If the distribution is normal you are in the top 2-3% of the class. z Scores: assignment In order to get an “A”, you should be in the top 10% of the class. What should be your z-Score to get an “A”, according to Chebyshev’s Theorem and assuming the distribution is symmetrical? Example: z Score • Population of profit margins for five American companies: 8%, 10%, 15%, 12%, 5% • µ = 10%, σ = 3.406% 74 TRANSFORMING z-SCORES TO A DISTRIBUTION WITH A PREDETERMINED mean and standard deviation An instructor gives an exam to a psychology class. For this exam, the distribution of raw scores has a mean of 57 with sd= 14. The instructor would like to simplify the distribution by transforming all scores into a new, standardized distribution with mean= 50 and sd= 10. To demonstrate this process, we will consider what happens to two specific students: Joe, who has a raw score of X= 64 in the original distribution; and Maria, whose original raw score is X= 43. Step 1: Transform each of the original raw scores into z-scores. For Joe, X= 64, so his z-score is 0.5 For Maria, X= 43, and her z-score is -1.0 Step 2: Change the z-scores to the new, standardized scores. Joe’s z-score, z=0.5, indicates that he is above the mean by exactly 1/2 standard deviation, so his standardized score would be 55. Maria’s score is located 1 standard deviation below the mean. So her new score would be X= 40. 75 Example: A psychologist has developed a new intelligence test. For years, the test has been given to a large number of people; for this population, mean= 65 and sd= 10. The psychologist would like to make the scores of this test comparable to scores on other IQ tests, which have mean= 100 and sd= 15. If the test is standardized so that it is comparable (has the same mean and sd) to other tests, what would be the standardized scores for the following individuals? For A B C z=(75-65)/10=1 z=(45-65)/10=-2 z=(67-65)/10=0.2 standardized score=100+1x15=115 100-2x15=70 100+0.2x15=103 76 Coefficient of Variation • Measures the size of the standard deviation relative to the size of the mean • Coefficient of variation =standard deviation/mean × 100% • Used to: – Compare the relative variabilities of values about the mean – Compare the relative variability of populations or samples with different means and different standard deviations – Measure risk Covariance, Correlation, and the Least Squares Line • When points on a scatter plot seem to fluctuate around a straight line, there is a linear relationship between x and y • A measure of the strength of a linear relationship is the covariance sxy  x  x y  y  n s xy  i 1 i i n 1 Covariance • A positive covariance indicates a positive linear relationship between x and y – As x increases, y increases • A negative covariance indicates a negative linear relationship between x and y – As x increases, y decreases Interpretation of the Sample Covariance Interpretation of the Sample Covariance Continued • Points in quadrant I correspond to xi and yi both greater than their averages – (x- x ) and (y-y̅) both positive so covariance positive • Points in quadrant III correspond to xi and yi both less than their averages – (x- x ) and (y-y̅) both negative so covariance positive • If sxy is positive, the points having the greatest influence are in quadrants I and III • Therefore, a positive sxy indicates a positive linear relationship Interpretation of the Sample Covariance Continued • Points in quadrant II correspond to xi less than x and yi greater than y – (x- x) is negative and (y-y̅) is positive so covariance negative • Points in quadrant IV correspond to xi greater than x and yi less than y̅ – (x- x) is positive and (y- y̅) is negative so covariance negative • If sxy is negative, the points having the greatest influence are in quadrants II and IV • Therefore, a negative sxy indicates a negative linear relationship Correlation Coefficient • Magnitude of covariance does not indicate the strength of the relationship – Magnitude depends on the unit of measurement used for the data • Sample correlation coefficient (r) is a measure of the strength of the relationship that does not depend on the magnitude of the data r s xy sx s y Correlation Coefficient Continued • Sample correlation coefficient r is always between -1 and +1 – Values near -1 show strong negative correlation – Values near 0 show no correlation – Values near +1 show strong positive correlation • Sample correlation coefficient is the point estimate for the population correlation coefficient ρ Least Squares Line • If there is a linear relationship between x and y, might wish to predict y on the basis of x • This requires the equation of a line describing the linear relationship • Line is calculated based on least squares line – Discussed in detail in Chapter 13 Least Squares Line Continued • Need to calculate slope (b1) and y-intercept (b0) s xy b1  2 • sx • b0  y  b1 x Least Squares Line for the Sales Volume Data Weighted Means • Sometimes, some measurements are more important than others – Assign numerical “weights” to the data • Weights measure relative importance of the value • Calculate weighted mean as w x w i i i where wi is the weight assigned to the ith measurement xi Example: Weighted Mean • June 2001 unemployment rates by census region – Northeast, 26.9 million in civilian labor force, 4.1% unemployment rate – South, 50.6 million, 4.7% unemployment – Midwest, 34.7 million, 4.4% unemployment – West, 32.5 million, 5.0 unemployment • Want the mean unemployment rate for the US Example: Weighted Mean Continued • Want the mean unemployment rate for the U.S. • Calculate it as a weighted mean – So that the bigger the region, the more heavily it counts in the mean • The data values are the regional unemployment rates • The weights are the sizes of the regional labor forces Example: Weighted Mean Continued  26 .9  4.1  50 .6  4.7   34 .7  4.4   32 .5  5.0   26 .9  50 .6  34 .7  25 .5  32 .5 663 .29   4.58 % 144 .7 • Note that the unweigthed mean is 4.55%, which underestimates the true rate by 0.03% – That is, 0.0003  144.7 million = 43,410 workers Descriptive Statistics for Grouped Data • Data already categorized into a frequency distribution or a histogram is called grouped data • Can calculate the mean and variance even when the raw data is not available • Calculations are slightly different for data from a sample and data from a population Descriptive Statistics for Grouped Data (Sample) • Sample mean for grouped data:  fi M i  fi M i x  n  fi • Sample variance for2grouped data:  f i M i  x  2 s  n 1 fi is the frequency for class i Mi is the midpoint of class i n = Σfi = sample size Descriptive Statistics for Grouped Data (Population) • Population mean for grouped data:  fi M i  fi M i   N  fi • Population variance for grouped data: 2   f M  x  i i 2  N fi is the frequency for class i Mi is the midpoint of class i N = Σfi = population size The Geometric Mean (Optional) • For rates of return of an investment, use the geometric mean to give the correct wealth at the end of the investment • Suppose the rates of return (expressed as decimal fractions) are R1, R2, …, Rn for periods 1, 2, …, n • The mean of all these returns is the calculated as the geometric mean: Rg  n 1  R1  1  R2  1  Rn  1 3- 96 Mean Deviation The arithmetic mean of the absolute values of the deviations from the arithmetic mean. The main features:  All values are used in the calculation.  It is sensitive to unusual data.  The absolute values are difficult to manipulate. So, it is not used in inferential stat. MD = X-X n Mean Deviation Chapter Three: Summary ONE Calculate the arithmetic mean, median, mode, weighted mean, and the geometric mean. TWO Explain the characteristics of each measure of location. THREE Identify the position of the arithmetic mean, median, and mode for both a symmetrical and a skewed distribution. Chapter Three: Summary FOUR Compute and interpret the range, the mean deviation, the variance, and the standard deviation of ungrouped data. FIVE Explain the characteristics, uses, advantages, and disadvantages of each measure of dispersion. SIX Understand Chebyshev’s theorem and the Empirical Rule as they relate to a set of observations.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Note