Download i Q - York University

Business Statistics, Can. ed. By Black, Chakrapani & Castillo Chapter 3 Discrete Distributions Descriptive Statistics Prepared by Dr. Clarence S. Bayne JMSB, Concordia University Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Learning Objectives • How to describe data and transform data to provide information • Define and quantify concepts of central tendency, variability, shape, and association. • Understand and interpret the mean, median, mode, percentiles(including quartiles), the range, variance and standard deviation. • Compute the mean, median, mode, percentiles ( and quartile); the range, mean absolute deviation, variance, and standard deviation using ungrouped data. • Differentiate between sample and population variance and standard deviation. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Learning Objectives -- Continued • Use of grouped data to compute the mean, mode, standard deviation, and variance. • Use of the empirical rule and Chebyshev’s theorem to understand probability distributions • Understanding the meaning of standard deviation in the context of the empirical rule and Chebyshev’s theorem • Understand skewness of shape of a distribution • Using the box and whisker plot to describe data. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Measures of Central Tendency: Ungrouped Data • Ungrouped data is any array of numbers which have not been summarized by statistical techniques • Measures of central tendency reveal information about the values at the center, or middle part, of a group of things (or ordered array). • Common Measures of central tendency are the : – Mode – Median – Mean – Percentiles – Quartiles Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. The Mode • The mode is the value that occurs most frequently in the data or array. • This conceptualization of the mode applies to all levels of data measurement. • Unimodal: describes data sets with a single mode • Bimodal: describes data sets that have two modes • Multimodal: describes data sets that contain more than two modes Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Example of the Mode • The arrangement of the numbers in the frame below is nonspecific and represents an array. • 44 is the data value that occurs most frequently(5). • The mode is 44. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 35 41 44 45 37 41 44 46 37 43 44 46 39 43 44 46 40 43 44 46 40 43 45 48 The Median • The median is the middle value in an ordered array of numbers • The median is unaffected by extremely large and extremely small values in the data set (array). Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Computing the Median  First Procedure – Arrange the observations in an ordered array. – If there is an odd number of observations, the median is the observation located at the middle the of the ordered array.  Second Procedure – If there is an even number of observations, the median is a value located on the line interval between the two middle observations.  General Procedure – The median’s position in an ordered array is given by (n+1)/2. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Median: Example with an Odd Number of Terms  Let X be an ordered array such that X has the following values: {3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21, 22}  There are 17 elements in the ordered array.  Position of median = (n+1)/2 = (17+1)/2 = 9th position  Counting from left top right, the median is 15.  Extreme values do not distort the median value.  Note that if 22 (the maximum) is replaced by 100, the median is still 15.  That if 3 (the minimum) is replaced by ‐103, the median is 15. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Median: Example with an Even Number of Terms  Let X be an ordered array such that X assumes the following values: {3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21}  There are 16 terms in the ordered array.  Position of median = (n+1)/2 = (16+1)/2 = 8.5 position  That is the median is a value between observations in the 8th and 9th positions in the ordered array. The median is 14 + 0.5(15‐14) = 14.5 or simply, (14+15)/2 =14.5  If the 21 is replaced by 100, the median is still 14.5.  If the 3 is replaced by ‐88, the median is still 14.5. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. The Arithmetic Mean • The arithmetic mean is commonly called ‘the mean’ • It is s the average of a group of numbers • The mean is computed by summing all values in the data set and dividing the sum by the number of values in the data set • Thus, its value is affected by each value in the data set, including extreme values Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Application of Arithmetic Mean in Statistics • Arithmetic mean used as a summary statistic of central tendency in data produced by business and economic processes. • When used in these settings it is important to make the distinction between − − The population mean: µ and the Sample mean X • The population mean based on the measures on all the possible outcomes of a process. • The sample mean is based on some of the outcome observation making up the population. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Population Mean X X  X  X ...  X    1 2 N N 24  13  19  26  11  5 93  5  18. 6 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 3 N Sample Mean X X  X  X ...  X  X  1 2 3 n n 57  86  42  38  90  66  6 379  6  63.167 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. n Impact of Extreme Values on the Mean • The mean is the most commonly used measure of central tendency because of its mathematical properties and because it uses all the data point in the data set. • However, the mean is affected by extremely large or extremely small numbers. • Note that for the sample mean example, if the largest number 66 is replaced by the number 1 000 that the mean becomes 218.833 as opposed to 63.167 • If the smallest number 57 is replaced by the number 5 the mean becomes 54.5 as opposed to 63.167. • The distortions are significant in both cases. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Percentiles • In general percentiles are not influenced by extreme values in the data set. • Percentiles are measures of central tendency that divide a group of data into 100 parts • The nth percentile: at least n% of the data lie below the nth percentile, and at most (100 ‐ n)% of the data lie above the nth percentile • For example: the 90th percentile is a value such that at least 90% of the data lie below it, and at most (no more than) 10% of the data lie above it • The median is defactoth the 50th percentile and has the same value as the 50 percentile. • Percentile are stair step values: the 88th and 89 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. percentile have no values between them. Percentiles are Stair Steps • Percentiles are discrete values that serve to separate lower values from upper values in the data set. • A percentile indicates the proportion of things that have values below it; and is the lower bound to the reverse proportion of things with values above it. • Thus, graphically, a percentile represents a single point at which there is a step up from the entire proportion of things less than it; or down from the proportion of values that is higher than it to the lower value proportion of things. • Stair Step Percentiles • Note that there are no values between the 87th and 88th percentiles. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Percentiles: Computational Procedure • Organize the data into an ascending ordered array. • Calculate the percentile location index using: P i (n) 100 Where Where percentile PP==percentile i=percentile percentile i= location location n=sample samplesize size n= • Search the ordered array counting from left to right to find where the percentile is located and determine its value. • If i is a whole number, the percentile is the average of the values in the ith and (i + 1)th positions. • If i is not a whole number, the percentile is at the whole number part of (i + 1) in the ordered array. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Calculating Percentiles: An Example • Raw Data: 14, 12, 19, 23, 5, 13, 28, 17 • Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28 • Problem: Find 30th percentile • Number of observations n=8 • Location of 30th Percentile: i  30 (8 )  2 . 4 100 • The location index, i, is not a whole number. • Therefore put location at whole number portion of ( i + 1) = 2.4 + 1 = 3.4. • The whole number portion is 3. The 30th percentile is at the 3rd location of the array: 30th percentile = 13. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Quartiles  Quartiles are measures of central tendency that divide a group of data into four subgroups Quartile values are not necessarily members of the data set – Q1: 25% of the data set is below the first quartile – Q2: 50% of the data set is below the second quartile – Q3: 75% of the data set is below the third quartile  Relationship between Quartiles and percentiles – Q1 is equal to the 25th percentile – Q2 is located at 50th percentile and equals the median – Q3 is equal to the 75th percentile Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Calculating Quartiles: An Example  Let X be an ordered array: If X={ 106, 109, 114, 116, 121, 122, 125, 129} then 109  114   111.5 2  Q1: 25 i (8)  2 100 Q1  Q2: 50 i (8)  4 100 116  121 Q2   118.5 2  Q3: 75 i (8)  6 100 122  125 Q3   123.5 2  Note that when i is a whole number the quartiles quartiles the average of the ith and (i+1)th values in the ordered set Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Dispersion and Convergence • Things tend to be alike or dissimilar; or to be associated in some way • Complete information about a company’s sales effectiveness would be better understood if one knew how much sales staff are exceeding, falling short, or just meeting the company’s historical standards. • The mean tells us about those staff in the middle, the average performers. But it does not tell us about the differences in performance. • Measures of dispersion or spread provide the tools that answer these the latter questions. • When used with measures of central tendency they make possible a more complete numerical description of the data • This variability is most frequently expressed in terms of deviation from the norm or mean. The images in the next slides express this visually Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Variability No Variability in Cash Flow (same amounts) Mean Mean Variability in Cash Flow (different amounts) Mean Mean Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Variability Variability No Variability Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Quantitative Indicators of Variability For Ungrouped Data  Measures of variability describe the spread or the dispersion of a set of data.  Common Measures of Variability are: – Range – Interquartile Range – Mean Absolute Deviation – Variance – Standard Deviation – Z scores – Coefficient of Variation Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Range The range is the difference between the largest and smallest values in the data set Usefulness: − Simple to compute Disadvantages; – Ignores all data points except extremes – Influenced by extreme values – Has no reference point – Has limited use by itself Example of range using data provided: Range  48  35  13 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 35 41 44 45 37 41 the two 44 46 37 43 44 46 39 43 44 46 40 43 44 46 40 43 45 48 Interquartile Range • The interquartile range contain all values in the interval between the first and third quartiles • The interquartile range account for the middle 50% of values in the ordered data set • The interquartile range is especially useful in situations where data users are more interested in values toward the middle and less interested in extremes. • The interquartile range is less influenced by extremes Interquartile Range  Q 3  Q1 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Deviation from the Mean • Data set: 5, 9, 16, 17, 18 • µ = 13 • An examination of deviations from the mean can reveal information about the variability of data. • However, the individual deviations are used mostly as a tool to compute other measures of variability • (x ‐ ) show distances around the mean or individual deviation from the mean: ‐8, ‐4, 3, 4, 5 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Mean Absolute Deviation • Show the average of the absolute deviations or the tendency for observations to differ on the average from the norm for the process or situation. • Easy to calculate but not as statistically good and unbiased estimate as the variance and standard deviation measures. Observations X X- µ |X-µ| 1 5 -8 +8 2 9 -4 +4 3 16 +3 +3 4 17 +4 +4 5 18 +5 +5 Totals 65 0 24   M . A. D. Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.  24 5  4. 8 X  N Population Variance • Average of the squared deviations from the arithmetic mean • Statistics measured in squared units are problematic to interpret. Customary to use standard deviation X 5 9 16 17 18 X   X -8 -4 +3 +4 +5 0    64 16 9 16 25 130 2  X    2  Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 2  130  5  26 .0 N Population Standard Deviation  X    2 • Square root of the variance  • Easier to interpret in practice 2  N 130  5  2 6 .0     2 6 .0  5 .1 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 2 Sample Variance • Average of the squared deviations from the arithmetic mean for a set of data X 2,398 1,844 1,539 1,311 7,092 X  X X 625 71 -234 -462 0  X  2 390,625 5,041 54,756 213,444 663,866 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. X  X  2 S 2  n 1 663,866  3  221,288.67 Sample Standard Deviation  X  X  2 • Square root of the sample variance S • Easier to interpret in practice than square units. 2  n1 6 6 3 ,8 6 6  3  2 2 1, 2 8 8 .6 7 S   S 2 2 2 1, 2 8 8 .6 7  4 7 0 .4 1 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Uses of Standard Deviation • Indicator of financial risk • Quality Control – construction of quality control charts – process capability studies • Comparing two or more populations – household incomes in two cities: – employee absenteeism at two plants – used as a percentage of the mean, the coefficient of variation (CV). Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Standard Deviation as an Indicator of Financial Risk Annualized Rate of Return Financial Security   A 15% 3% B 15% 7% Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Symmetric and Asymmetric Distributions • Data are either symmetric or non‐symmetric with respect to some measure of central tendency • Statisticians have observed that distributions describing many types of business and economic data tend to be symmetric or have a normal shape • They found that in practical terms the processes that generate symmetric data have special and exact properties(the empirical rule) with respect to data concentration. • Non‐symmetric distributions, in practice and theory, obey as a minimum specified rules with respect to the concentration of data values in a population (The Chebyschev Theorem). Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Empirical Rule When data are normally distributed or approximately normal. Distance from the Mean   1   2   3 Percentage of Values Falling Within Distance Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 68 95 99.7 - Chebyshev’s Theorem When Data are Normally Distributed or Nonsymmetric. • The Chebyshev Theorem applies to all distributions • It measures the minimum mass or concentration of data that lies within a specifies number of standard deviation around the mean. 1 P(  k  X    k )  1  2 k for k > 1 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Coefficient of Variation • Ratio of the standard deviation to the mean, expressed as a percentage • Measurement of relative dispersion  C V  100  Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Coefficient of Variation 1  29   2  84   4.6 1  CV   1 1 100 1  10 2  CV   4.6 100  29  15.86 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 2 2 100 2 10 100  84  11.90 3.2 MEASURES OF VARIABILITY: UNGROUPED DATA 73 Solution The researcher computes the mean absolute deviation, the variance, and the standard deviation for these data in the following manner. X 55 100 125 140 60 29 36 1,681 16 841 1,936 1,296 I.x=480 I.lx- xl = 154 I.(x - x)2 = 5,110 n of 41 4 44 X= LX = 480 = 96 n MAD = 5 5 154 = 30.8 5 2 = 5 •770 = 1,442.5 4 s= Fs2 = 37.98 She then uses computational formulas to solve for s> and s and compares the results. Jil X 55 100 125 140 60 3,025 10,000 15,625 19,600 3,600 I.x= 480 I.x2 = 51,850 .. 51 850- 4802 ' 5 52=-------..::~ 4 51,850 - 46,080 = 5, 770 = 1, 442.5 4 4 s = .J1,442.5 = 37.98 The results are the same. The sample standard deviation obtained by both methods is 37.98, or 38, years. zSCORES A z score represents the number of standard deviations a value (x) is above or below the mean of a set of numbers when the data are normally distributed. Using z scores allows a value's raw distance from the mean to be translated into units of standard deviations. 74 CHAPTER 3 DESCRIPTIVE STATISTICS z Score X -}1 z =-- cr For samples, x-x z=-- s If a z score is negative, the raw value (x) is below the mean. If the z score is positive, the raw value (x) is above the mean. For example, for a data set that is normally distributed with a mean of 50 and a standard deviation of 10, suppose a statistician wants to determine the z score for a value of 70. This value (x = 70) is 20 units above the mean, so the z value is z = 70 - 50 = +2.00 10 This z score signifies that the raw score of 70 is two standard deviations above the mean. How is this z score interpreted? The empirical rule states that 95% of all values are within two standard deviations of the mean if the data are approximately normally distributed. Figure 3.7 shows that because the value of 70 is two standard deviations above the mean (z= +2.00), 95% of the values are between 70 and the value (x = 30) that is two standard deviations below the mean, or z = (30 - 50 )Ito = -2.00. Because 5% of the values are outside the range of two standard deviations from the mean and the normal distribution is symmetrical, 2¥2% (¥2 of the 5%) are below the value of 30. Thus 97¥2% of the values are below the value of 70. Because a z score is the number of standard deviations an individual data value is from the mean, the empirical rule can be restated in terms ofz scores. Between z = -1.00 and z = +1.00 are approximately 68% of the values. Between z= -!.oo andz= +2.00 are approximately95% of the values. Between z = -3.00 and z = +3.00 are approximately 99.7% of the values. The topic of z scores is discussed more extensively in Chapter 6. COEFFICIENT OF VARIATION The coefficient of variation is a statistic that is the ratio of the standard deviation to the mean expressed in percentage and is denoted CV. Coefficient of Variation cv = ~(100) f1 The coefficient of variation is essentially a relative comparison of a standard deviation to its mean. The coefficient of variation can be useful in comparing standard deviations that have been computed from data with different means. Measures of Central Tendency and Variability: Grouped Data  Measures of Central Tendency  Mean  Median  Mode  Measures of Variability  Variance  Standard Deviation Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Mean of Grouped Data • Weighted average of class midpoints • Relative class frequencies are the weights • Weight are: f for i  1, 2, 3,........ k i N k      fi M i1 i k   i1 fi fM N f 1M 1  f 2M f 1 f 2 2  f 3M 3      fk M  f 3      fk Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. k Calculation of Grouped Mean Class Interval Frequency Class Midpoint 20-under 30 6 25 30-under 40 18 35 40-under 50 11 45 50-under 60 11 55 60-under 70 3 65 70-under 80 1 75 50 fM 2150     43. 0  f 50 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. fM 150 630 495 605 195 75 2150 Variance and Standard Deviation from Grouped Data Population Sample  f  M   S   N 2 2    2 2  S  Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.  M  X  2 f n1 S 2 Population Variance and Standard Deviation of Grouped Data Class Interval f M 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 6 18 11 11 3 1 50 25 35 45 55 65 75  2   f M N   fM 150 630 495 605 195 75 2150 2 7200   144 50 M   M   -18 -8 2 12 22 32 324 64 4 144 484 1024  Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd.  2 2 f  144  12 M   2 1944 1152 44 1584 1452 1024 7200 Descriptions and Measures of Shape Skewness – Absence of symmetry – Presence of extreme values in one or other side of a distribution Box and Whisker Plots – Graphic display of a distribution using 5‐ summary statistics – Reveals skewness and data location or clustering Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Probability Distributions Showing Symmetry and Skewness 0.30 0.30 0.4 0.4 0.25 0.25 0.3 0.3 0.20 0.20 0.15 0.15 0.2 0.2 0.10 0.10 0.1 0.1 0.05 0.05 0.0 0.0 -4 0 -4 -3 -3 -2 -2 Symmetrical -1 -1 0 0 1 12 10 8 6 4 2 0 1 2 2 3 0.00 0.00 0 3 0 0 0 2 2 4 4 6 6 8 8 10 10 12 12 Right or Positively Skewed 12 10 8 6 4 2 0 0.70 0.70 0.75 0.75 0.80 0.80 0.85 0.85 0.90 0.90 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. 0.95 0.95 0 1.00 1.00 0 Left or Negatively Skewed 0 Symmetrical Shape Frequency Histogram Showing Relationship of Mean, Median and Mode Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Relationship of Mean, Median and Mode When Data is Negatively Skewed (To the Left) Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Relationship of Mean, Median and Mode When Data is Positively Skewed(To the Right) Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Requirements for A Box and Whisker Plot  Five specific values are used: – Median, Q2 – First quartile, Q1 – Third quartile, Q3 – Minimum value in the data set – Maximum value in the data set  Inner Fences: First Indicators of extreme values – IQR = Q3 ‐ Q1 – Lower inner fence = Q1 ‐ 1.5 IQR – Upper inner fence = Q3 + 1.5 IQR  Outer Fences: Strong Indicators of extreme values – Lower outer fence = Q1 ‐ 3.0 IQR – Upper outer fence = Q3 + 3.0 IQR Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Box and Whisker Plot Minimum Q1 Q2 Business Statistics, Can. Ed. © 2010 John Wiley & Sons Canada, Ltd. Q3 Maximum AND WHI Pl01'S Another way to describe a distribution of data is by using a hox and whisker plot. A box and whisker plot, sometimes called a box plot, is a diagram t/wt utilizes tilt: upper arui .l01ver quartiles along with the median and the· two nuut extreme wdues to depict a distribution graphically,. The plot is constructed by us.ing a box to enclose the median. Ihis box is extended outward from the median along a cmllinuum to the lower and upper quartiles, en dosing not only the median but also the middle 50% of the data. From the lower and upper quartiles, lines refem·d to as whiskers ;m: extended Nit from the box toward the outermost data values. The box and whisker plot is determined from five specific numbers. 'The median (Q,) The lower quartile ( Ot) J. The upper quartile ( CM 4· "ll1e smallest value in the distribution s The hngest vah1e in the distribution L 2. 'TI1e box of the plot is determined by local ing the median and the lower and upper quartile:; on a continuum. A box is drawn around the median with the lower and upper quartiles (Q, and Q,) as the box endpoints. These box endpoints (Q, and Q.) are referred to as the hinges of the box. Next, the value of the interquartile range (IQR) is computed by Q,- Q,. The interquartile range indudes the middle 5o% of the data and shotlld equal the length of the box. However, here tbe interquartile range is used outside the box also. At a distance of 1.5 · 1QR outward fwm the lower and upper quartiles are what are referred to as inner fences. A whisket; a line segment, is drawn from the lower hinge of the box outward to the smallest data value. A second whisker is drawn from the upper hinge of the box outward to the largest data value. '!he inner fences are established as follows: Q~ ~ 1.5 · IQR Q;+ LS ·IQR PRINTED BY : Nuri Jazairi <[email protected]>. Printing is for personal, private use only. No pan of this book may be reproduced or transmitted without publisher's prior permission. Violators will be prosecuted. 90 CHAPTER 3 DESCRIPTIVE STATISTICS If data fall beyond the inner fen.:es, then out.:r fences can be constructed: Q ~- 3.0 · IQR Q ,+ 3.0 . IQR Pigure 3.13 shows tht• features of a box and whisker plot, Data values outside the .mainstn.•Jm ofvJlues in a distribution are viewed as outliers. Outliers can be merely the more extreme values of a data set. However, sometimes outliers occur due to measurement or re.:ording errors. Other times they are values so unlike the other values that they should not he considered in the same an., lysis as the rest of the distribution. Values in the data distribution that are outside the inner fences but within the outer fences are referred to a mild outliers. Values that are outside the outer fences are called extremr outliers. Thus, one of the main uses of a box and whisker plot is to identify outliers. In some computer-produced box and whisker plots, the whiskers are drawn Io tlle largest and smallest data values within the inner fences. An asterisk i then printed for each data value located between the inner and outer fences to indicate a mild outlier. Values outside the outer fences arc indicated by a zero on the graph. "These V'.llues are extreme outliers. G+ii¥if- Hinge Hin~e Box and Whisker Plot 15 •1QR\ /U •IQR ~.0 3.0 • H)R Data for Box and Whisker Plot 71 76 70 82 74 ~7 82 79 79 65 63 74 74 73 62 64 68 64 68 62 72 75 80 81 84 73 73 84 72 82 • IQR 81 85 77 81 69 69 71 73 65 71 Another U.!.e of box and whisker plots is to determine whether a c.listribution is skewed. The location of the median in the box can relate information about the skewness of the middle 50% of the data. If the median is located on the right side of the box, then the middle 50% are skewed to the left. If the median is located on the left side of the box, then the middle 50 % arc skewed to the right. By examining the length of the whi ken. on ea.:h side of the box, a business re ear.~her can make a judgement about the skewness of the outer values. If the longest whisker is to the right of the box, then the oukr data are skewed to the right, and vice versa. We shall use the data gi\'l!ll in Table J.Io to construct a box and whisker plot. After organizing the data into an ordered array, as shown in Table 3.11, it is relatively easy to determine the \'alues of the lower quartile (Q,). the median, and the upper quartile ( Q J) . From these, the value of the intcrquartile range can bl! computed. ·nu~ hinges of the box are Jocttted at the lower and upper quartiles, 69 and So. s. The median is located within the box at distances of 4 from the hnver <1uartile and 6.5 from the upper quartile . Tile tl1stribution of the middle so "b of the data is skewed right, PRINTED BY: Nuri Jazairi <nuri@yorku .ca>. Printing is for personal, private use only. No part of this book may be reproduced or transmitted without publisher's prior permission. Violators will be prosecuted. 3.4 MEASURES OF SHAPE 91 because the median is nearer to the lower or left hinge. 'The inner fence is constructed by: Q,- 1.5 - lQR :::: 69 - 1.5(11.5) =69 - 17.25 =51.75 and Q, + 1.5 · lQR = 80.5 + LS( 11.5) =80.5 + 17.25 =97.75 'Ihe whiskers arr constructed by drawing a linr segmrnt from the lower hinge outward to the smallest data value and a line segment from the upprr hinge outward to the largest data value. An examination of the data reveals that no data values in this set of numbers arc outside the inner fence. TI1e whiskers arc constructed outward to the lowest value, which is 62. and to the highest value, which is 87. To <onstruct an outer fence, we <akulate Q, - 3 · IQR and Q, + 3 · IQR, as follows: Q, - 3 · IQR = 69 - 3(1 1.5) =69- 34.5 = 3·1.5 Q_, + 3 · IQR -= 80.5 + 3( 11.5) = 80.5 + 34.5 "" 115.0 Figure 3.14 is the computer printout for this box .md whisker plot. Box and Whisker Plot .. 70 (,0 RO 90 Table data IM§tJ<iiData in Ordered Array with Quartiles and Median 87 80 73 69 85 79 73 68 84 79 84 77 73 72 68 65 82 76 72 65 82 75 71 64 82 74 71 64 Qt=69 =median =73 Q2 =80.5 IQR = Q3 - Q1 = 80.5 - 69 = 11.5 Qz 81 74 71 63 81 74 70 62 81 73 69 62 # text table 3.10/3.11 # N=40 87 80 73 69 85 79 73 68 84 79 73 68 84 77 72 65 82 76 72 65 82 75 71 64 82 74 71 64 81 74 71 63 81 74 70 62 81 73 69 62 #Q1=69 #Q2= median = 73 #Q3= 80.5 #IQR = Q3 - Q1 = 80.5 - 69 = 11.5 65 70 75 80 85 # 69-1.5*11.5 = 51.75 # 80.5+1.5*11.5 = 97.75 (min 62 no outlier) (max 87 no outlier) 5 10 15 x<-c(1:9,16) y<-1:10 boxplot(x,y,col="rainbow"(2)) 1 2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download i Q - York University