Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2.5 – Using the Mean and Standard Deviation to Describe Data The standard deviation is a measure of variability of sample data. The smaller standard deviation, the bigger percentage of measurements are close to the mean. Chebyshev’s Rule: For any number k >1, at least (1-1/k2) measurements are within k standard deviations of the mean, that is at least (1-1/k2) measurements are in the interval Consequently: a) If k=2 then at least 3/4 = 75% measurements are within 2s from the mean b) If k=3 then at least 8/9 = 89% measurements are within 3s from the mean ≥ 8/9 = 89% ≥ 3/4 = 75% Chebyshev’s rule applies to a sample from any distribution. Empirical Rule: If a sample is taken from mound-shaped distribution (symmetric and shaped as below) then a) Approximately 68% of the measurements will fall within 1 standard deviation of the mean, that is within the interval b) Approximately 95% of the measurements will fall within 2 standard deviations of the mean, that is within the interval c) Approximately 99.7% (essentially all) of the measurements will fall within 3 standard deviations, of the mean, that is within the interval min 4𝑠 ≤ 𝑟𝑎𝑛𝑔𝑒 ≤ 6𝑠 range ⇒ max 𝑟𝑎𝑛𝑔𝑒 𝑟𝑎𝑛𝑔𝑒 ≤𝑠 ≤ 6 4 Exercise: (Ex. 2.11, p. 80) The 50 companies' percentages of revenues spent on R&D are below (sorted already) 5.2 5.6 5.9 6.0 6.5 6.5 6.5 6.6 6.8 6.9 6.9 6.9 7.1 7.1 7.2 7.2 7.4 7.5 7.5 7.7 7.7 7.8 7.9 8.0 8.0 8.1 8.2 8.2 8.2 8.4 8.5 8.8 9.0 9.2 9.4 9.5 9.5 9.6 9.7 9.9 10.1 10.5 10.5 10.6 11.1 11.3 11.7 13.2 13.5 13.5 1. Calculate the range and use it to obtain a rough approximation of s. Ans: …… ≤ 𝑠 ≤ ……… 2. Compute = …………… and s = ………………. 3. Calculate the intervals for k=1, 2, 3, and for each interval give a. Percentage estimated by the Chebyshev’s Rule b. Percentage estimated by the Empirical Rule c. The actual percentage of observations in the interval. Compare them with a. and b. k 1 2 3 Chebyshev’ s Rule Empirical Rule Actual 2.6 – Measures of Relative Standing Key concepts: percentile, z-score z-score = number of standard deviation that x is above (if positive) or below (if negative) of the mean s • x mean The (sample) z-score for a measurement x is the number The mean of z-scores is always 0, , (see Exercise in Lecture 2.5) The standard deviation of z-scores is always 1, For mound-shaped distribution 1. Approximately 68% of the measurements will have a z-score between -1 and 1. 2. Approximately 95% of the measurements will have a z-score between -2 and 2. 3. Approximately 99.7% (almost all) of the measurements will have a z-score between -3 and 3. Tom Exercise. Tom’s SAT z-score is 2.0. Assuming mound-shaped distribution this means that a) approximately 2.5% of students who took SAT scored better than Joe b) approximately 97.5% of students who took SAT scored worse than Joe The pth percentile (of a data set). Intuitively, the pth percentile is a number such that approximately p% of the measurements (arranged in ascending order) fall below the pth percentile and (100 - p)% fall above it. p% (1-p)% Methods of computing vary; different software or calculators may give different answers. Important percentiles: The median = 50th percentile The lower quartile QL = 25th percentile The upper quartile QU = 75th percentile Min QL median QU Max 2.7 – Methods for Detecting Outliers: Box Plot and z-Scores Key concepts: interquartile range, inner and outer fences, outlier, box plot, The interquartile range (IQR) is the distance between the lower and upper quartiles: IQR = QU- QL Inner fences: Outer fences: Lower inner fence (LIF) = QL – 1.5 IQR Upper inner fence (UIF) = QU + 1.5 IQR Lower outer fence (LOF) = QL – 3 IQR Upper outer fence (LUF) = QU + 3 IQR An outlier is an observation (or measurement) that is unusually large or small relative to the other values in a data set Rules of Thumb for Detecting Outliers Box Plot Method: o Observations falling beyond the inner fences are called outliers. o Observations falling between the inner fences and the outer fences are called suspect outliers o Observations falling beyond the outer fences are deemed highly suspect outliers. z-scores: Observations with z-scores greater than 3 in absolute value are considered outliers. The numbers: minimum, QL, median, QU, maximum, are called five number summary. Box Plot: Graphical visualization of five number summary (minimum, QL, median, QU, maximum) and outliers. Box plot can be drawn horizontally or vertically outliers min QL med QU max Example. Data: 45 46 49 35 76 80 89 109 37 61 62 64 68 56 57 57 59 71 72 (n=19) Sorted Data: 35 37 45 46 49 56 57 57 59 61 62 64 68 71 72 76 80 89 109 Mean = 62.79, s = 18.00, Min = 35, Max = 109, Median = 61, TI-83: QL = 49, QU = 72, LIF = 49 – 1.5×23 = 14.5, LOF = 49 – 3×23 = -20, IQR = 72 - 49 = 23 UIF = 72 +1.5×23 = 106.5 UOF = 72 +3×23 = 141 Excel: QL = 52.5, QU = 71.5, LIF = 52.5 – 1.5×19 = 24, LOF = 52.5 – 3×19 = -4.5, IQR = 71.5- 52.5 = 19 UIF = 71.5 +1.5×19 = 100 UOF = 71.5+3×19 = 128.5 (Note that Excel gives different quartiles and fences, and hence possibly outliers) 1. Determine if there are any outliers (using Box Plot method) Answer: 109 is a suspect outlier, no highly suspect outliers 2. Determine if there are any outliers (using z-scores method) Answer: = 8.79, =116.79, no outliers in z-score sense 3. Make a box plot