Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 3 3-1 Chapter Goals After completing this chapter, you should be able to: Compute and interpret the mean, median, and mode for a set of data Find the range, variance, and standard deviation and know what these values mean Chapter 3 Numerical Descriptive Measures Construct and interpret a box and whiskers plot Compute and explain the coefficient of variation Use numerical measures along with graphs, charts, and tables to describe data Chap 3-1 Yandell – GSBA 502 Yandell – GSBA 502 Chap 3-2 Summary Measures Chapter Topics Measures of Center and Location Other measures of Location Describing Data Numerically Mean, median, mode, geometric mean, midrange Center and Location Weighted mean, percentiles, quartiles Skewness (shape) Linear correlation Yandell – GSBA 502 Range Interquartile Range Quartiles Mode Variance Weighted Mean Yandell – GSBA 502 Chap 3-4 Measures of Center and Location Overview Population Parameters are denoted with a letter from the Greek alphabet: Center and Location : (mu) represents the population mean F (sigma) represents the population standard deviation Mean Sample Statistics are commonly denoted with letters from the Roman alphabet: Standard Deviation Coefficient of Variation Chap 3-3 Notation Conventions Skewness Percentiles Median Range, interquartile range, variance and standard deviation, coefficient of variation Variation Mean Measures of Variation Other Measures of Location X= _ X (X-bar) represents the sample mean S represents the sample standard deviation ∑X i =1 n N μ= Yandell – GSBA 502 Median Mode Weighted Mean n Chap 3-5 ∑X i =1 N Yandell – GSBA 502 i ∑w X ∑w ∑w X = ∑w i XW = i Midpoint of ranked values Most frequently observed value i i μW i i i Chap 3-6 Chapter 3 3-2 Mean (Arithmetic Average) Mean (Arithmetic Average) (continued) The Mean is the arithmetic average of data values ∑ Xi = i=1 n Population mean X1 + X 2 + L + Xn n The most common measure of central tendency Mean = sum of values divided by the number of values Affected by extreme values (outliers) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 N = Population Size N μ= n = Sample Size Sample mean n X= ∑X i=1 N Mean = 3 i = X1 + X 2 + L + XN N Yandell – GSBA 502 Mean = 4 1 + 2 + 3 + 4 + 10 20 = =4 5 5 1 + 2 + 3 + 4 + 5 15 = =3 5 5 Chap 3-7 Yandell – GSBA 502 Chap 3-8 Median Median (continued) Not affected by extreme values 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Median = 3 Median = 3 In an ordered array, the median is the “middle” number (50% above, 50% below) To find the median, rank the n values in order of magnitude Find the value in the (n+1)/2 position If n or N is odd, the median is the middle number If n or N is even, the median is the average of the two middle numbers Yandell – GSBA 502 Chap 3-9 If n is an even number, let the median be the mean of the two middle-most observations. Yandell – GSBA 502 Chap 3-10 Mode Weighted Mean A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical data There may may be no mode There may be several modes 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Mode = 5 Yandell – GSBA 502 Used when values are grouped by frequency or relative importance Example: Sample of 26 Repair Projects Days to Complete 0 1 2 3 4 5 6 Weighted Mean Days to Complete: Frequency 5 4 6 12 7 8 8 2 XW = ∑w X ∑w i i = (4 × 5) + (12 × 6) + (8 × 7) + (2 × 8) 4 + 12 + 8 + 2 = 164 = 6.31 days 26 i No Mode Chap 3-11 Yandell – GSBA 502 Chap 3-12 Chapter 3 3-3 Review Example Summary Statistics Five houses on a hill by the beach House Prices: $2,000 K $2,000,000 500,000 300,000 100,000 100,000 House Prices: $2,000,000 500,000 300,000 100,000 100,000 $500 K $300 K Sum $3,000,000 $100 K Mean: ($3,000,000/5) = $600,000 Median: middle value of ranked data = $300,000 Mode: most frequent value = $100,000 $100 K Yandell – GSBA 502 Chap 3-13 Yandell – GSBA 502 Chap 3-14 Which measure of location is the “best”? Note Mean is generally used, unless extreme values (outliers) exist Then median is often used, since the median is not sensitive to extreme values The mean and median values do not have to be values that are part of the data set. Example (four observations): 2 , 3 , 4 , 11 Mean = (2+3+4+11)/4 = 5 Median = 3.5 Example: Median home prices may be reported for a region – less sensitive to outliers (The median position is (N+1)/2 = 2.5th position, so use the midpoint of middle-most values) Yandell – GSBA 502 Chap 3-15 Yandell – GSBA 502 Measuring Skewness Shape of a Distribution Describes how data is distributed Symmetric or skewed Left-Skewed Symmetric Chap 3-16 A number called the coefficient of skewness (SK) is commonly used to measure skewness: SK = Right-Skewed where Yandell – GSBA 502 S S is the sample standard deviation, _ X is the sample mean, and Median is the sample median. Mean < Median < Mode Mean = Median = Mode Mode < Median < Mean (Longer tail extends to left) _ 3 (X ! Median) (Longer tail extends to right) Chap 3-17 Yandell – GSBA 502 Chap 3-18 Chapter 3 3-4 Measuring Skewness Other Location Measures (continued) The magnitude of SK indicates the degree of skewness, where Other Measures of Location -3 # SK # +3 6 6 6 SK < 0 SK = 0 SK > 0 Percentiles skewed left symmetric (not skewed) skewed right The pth percentile in a data array: Yandell – GSBA 502 Chap 3-19 3rd quartile = 75th percentile Chap 3-20 Quartiles split the ranked data into 4 equal groups 25% 25% 25% 25% Q1 Q2 Q3 Example: Find the first quartile Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22 Example: The 60th percentile in an ordered array of 19 values is the value in 12th position: (n = 9) 25 (9+1) = 2.5 position 100 so use the value half way between the 2nd and 3rd values, Q1 = 25th percentile, so find the p 60 i= (n + 1) = (19 + 1) = 12 100 100 so Chap 3-21 Q1 = 12.5 Yandell – GSBA 502 Chap 3-22 Quartile Formulas Box and Whisker Plot Find a quartile by determining the value in the xth position of the ranked data, where First quartile: 2nd quartile = 50th percentile = median Yandell – GSBA 502 p (n + 1) 100 Yandell – GSBA 502 1st quartile = 25th percentile Quartiles The pth percentile in an ordered array of n values is the value in ith position, where i= (where 0 ≤ p ≤ 100) Percentiles p% are less than or equal to this value (100 – p)% are greater than or equal to this value The coefficient of skewness is calculated and reported by many computer statistical software packages. Quartiles A Graphical display of data using 5-number summary: Minimum -- Q1 -- Median -- Q3 -- Maximum Q1 = (n+1)/4 Example: Second quartile: Q2 = (n+1)/2 (the median position) 25% Third quartile: Minimum where n is the number of observed values Yandell – GSBA 502 25% 25% 25% Q3 = 3(n+1)/4 Minimum Chap 3-23 Yandell – GSBA 502 1st 1st Quartile Quartile Median Median 3rd 3rd Quartile Quartile Maximum Maximum Chap 3-24 Chapter 3 3-5 Distribution Shape and Box and Whisker Plot Shape of Box and Whisker Plots The Box and central line are centered between the endpoints if data is symmetric around the median Left-Skewed A Box and Whisker plot can be shown in either vertical or horizontal format Yandell – GSBA 502 Chap 3-25 Q1 Below is a Box-and-Whisker plot for the following data: Min 0 Q1 2 Q2 2 2 3 3 Q3 4 5 00 22 33 55 5 Chap 3-27 Interquartile Range Population Variance Sample Variance Yandell – GSBA 502 Chap 3-28 Variation Variation Standard Deviation If you have several data sets and wish to make comparisons, PHStat can create multiple plots in the same display window Yandell – GSBA 502 Measures of Variation Variance The PHStat add-in can be used to easily create a Box-and-Whisker Plot. Click here to see a Box-and-Whisker plot created using PHStat This data is very right skewed, as the plot depicts Range Chap 3-26 27 227 7 Yandell – GSBA 502 Q1 Q2 Q3 Using PHStat to construct a Box-and-Whisker Plot Max 10 Q1 Q2 Q3 Right-Skewed Yandell – GSBA 502 Box-and-Whisker Plot Example Q2 Q3 Symmetric Coefficient of Variation Measures of variation give information on the spread or variability of the data values. Population Standard Deviation Sample Standard Deviation Same center, different variation Chap 3-29 Yandell – GSBA 502 Chap 3-30 Chapter 3 3-6 Range Disadvantages of the Range Simplest measure of variation Difference between the largest and the smallest observations: Ignores the way in which data are distributed 7 8 9 10 11 12 7 8 Range = 12 - 7 = 5 Range = xmaximum – xminimum 9 10 11 12 Range = 12 - 7 = 5 Sensitive to outliers Example: 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 Range = 5 - 1 = 4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 14 - 1 = 13 Range = 120 - 1 = 119 Yandell – GSBA 502 Chap 3-31 Yandell – GSBA 502 Chap 3-32 Interquartile Range Interquartile Range Can eliminate some outlier problems by using the interquartile range Example: X 25% Eliminate some high- and low-valued observations and calculate the range from the remaining values 12 Chap 3-33 Sample variance: n S2 = σ = listen 2 70 Chap 3-34 Most commonly used measure of variation Shows variation about the mean Has the same units as the original data n -1 ∑ (X − μ) i=1 57 i N Population variance: 2 Yandell – GSBA 502 ∑ (X − X) i=1 45 maximum 25% Standard Deviation Average of squared deviations of values from the mean 30 25% Yandell – GSBA 502 Variance 25% X Q3 Interquartile range = 57 – 30 = 27 Interquartile range = 3rd quartile – 1st quartile Yandell – GSBA 502 Median (Q2) Q1 minimum n Sample standard deviation: S= 2 i Population standard deviation: Chap 3-35 Yandell – GSBA 502 i=1 2 i n -1 N σ= N ∑ (X − X) ∑ (X − μ) i=1 2 i N Chap 3-36 Chapter 3 3-7 Calculation Example: Sample Standard Deviation Sample Data (Xi) : 10 12 14 15 n=8 18 18 24 Small standard deviation Mean = X = 16 (10 − x ) 2 + (12 − x ) 2 + (14 − x ) 2 + L + (24 − x ) 2 n −1 s = = 17 Measuring variation (10 − 16) 126 7 = 2 + (12 − 16) = 2 + (14 − 16) 8 −1 2 + L + (24 − 16) Large standard deviation 2 4.2426 Yandell – GSBA 502 Chap 3-37 Yandell – GSBA 502 Advantages of Variance and Standard Deviation Comparing Standard Deviations Data A 11 12 13 14 15 16 17 18 19 20 21 14 15 16 17 18 19 20 21 Data B 11 12 13 12 13 Mean = 15.5 s = 3.338 Mean = 15.5 s = .9258 14 15 16 17 18 19 20 21 Chap 3-39 Yandell – GSBA 502 The Empirical Rule Chap 3-40 The Empirical Rule If the data distribution is bell-shaped, then the interval: μ ± 1σ contains about 68% of the values in the population or the sample X μ ± 2σ contains about 95% of the values in the population or the sample μ ± 3σ contains about 99.7% of the values in the population or the sample 68% μ μ ± 1σ Yandell – GSBA 502 Values far from the mean are given extra weight Mean = 15.5 s = 4.57 Yandell – GSBA 502 Each value in the data set is used in the calculation (because deviations from the mean are squared) Data C 11 Chap 3-38 Chap 3-41 Yandell – GSBA 502 95% 99.7% μ ± 2σ μ ± 3σ Chap 3-42 Chapter 3 3-8 Comparing Coefficient of Variation Coefficient of Variation Measures relative variation Always in percentage (%) Shows variation relative to mean ⎛S⎞ $5 CVA = ⎜⎜ ⎟⎟ ⋅ 100% = ⋅ 100% = 10% $50 ⎝X⎠ Is used to compare two or more sets of data measured in different units Population listen ⎛σ CV = ⎜⎜ ⎝ μ ⎛ S CV = ⎜⎜ ⎝ X Stock B: Sample ⎞ ⎟⎟ ⋅ 100% ⎠ Stock A: Average price last year = $50 Standard deviation = $5 ⎞ ⎟ ⋅ 100% ⎟ ⎠ Yandell – GSBA 502 Chap 3-43 Average price last year = $100 Standard deviation = $5 ⎛S⎞ $5 CVB = ⎜⎜ ⎟⎟ ⋅ 100% = ⋅ 100% = 5% $100 X ⎝ ⎠ Yandell – GSBA 502 Chap 3-44 Using Microsoft Excel Using Excel Descriptive Statistics are easy to obtain from Microsoft Excel Both stocks have the same standard deviation, but stock B is less variable relative to its price Use menu choice: tools / data analysis / descriptive statistics Use menu choice: tools / data analysis / descriptive statistics Enter details in dialog box Click here to open house price worksheet, then follow steps shown below to obtain descriptive statistics Yandell – GSBA 502 Chap 3-45 Yandell – GSBA 502 Chap 3-46 Using Excel Excel output (continued) Microsoft Excel descriptive statistics output, using the house price data: House Prices: Enter dialog box details $2,000,000 500,000 300,000 100,000 100,000 Check box for summary statistics Click here to start demo Click OK Yandell – GSBA 502 Chap 3-47 Yandell – GSBA 502 Chap 3-48 Chapter 3 3-9 Scatter Plots and Correlation Scatter Plot Examples Strong relationships A scatter plot (or scatter diagram) is used to show the relationship between two variables Y Correlation analysis is used to measure strength of the linear association (linear relationship) between two variables Weak relationships Y X Only concerned with strength of the relationship X Y Y No causal effect is implied X X Yandell – GSBA 502 Chap 3-49 Yandell – GSBA 502 Chap 3-50 Scatter Plot Examples Correlation Coefficient (continued) (continued) No relationship Y X Y The population correlation coefficient ρ (rho) measures the strength of the association between the variables The sample correlation coefficient r is an estimate of ρ and is used to measure the strength of the linear relationship in the sample observations X Yandell – GSBA 502 Chap 3-51 Yandell – GSBA 502 Chap 3-52 Examples of Approximate r Values Features of ρ and r Yandell – GSBA 502 Unit free Range between -1 and 1 The closer to -1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker the linear relationship Y Y Y X r = -1 r = -.6 Y Chap 3-53 Yandell – GSBA 502 X X r=0 Y r = +.3 X r = +1 X Chap 3-54 Chapter 3 3-10 Calculating the Correlation Coefficient Calculation Example Sample correlation coefficient: r= ∑ ( X − X)( Y − Y ) [∑ ( X − X) ][ ∑ ( Y − Y ) ] 2 2 or the algebraic equivalent: n∑ XY − ∑ X∑ y r= [n( ∑ X ) − ( ∑ X) ][n( ∑ Y ) − ( ∑ Y ) ] 2 2 2 2 where: r = Sample correlation coefficient n = Sample size X = Value of the independent variable Y = Value of the dependent variable Yandell – GSBA 502 Chap 3-55 Tree Height Trunk Diameter Y X XY Y2 X2 35 8 280 1225 64 49 9 441 2401 81 27 7 189 729 49 33 6 198 1089 36 60 13 780 3600 169 21 7 147 441 49 45 11 495 2025 121 51 12 612 2601 144 Σ=321 Σ=73 Σ=3142 Σ=14111 Σ=713 Yandell – GSBA 502 Chap 3-56 Calculation Example Excel Output (continued) Tree Height, Y 70 r= n∑ XY − ∑ X∑ Y 60 = 50 40 Excel Correlation Output [n( ∑ X ) − ( ∑ X) ][n( ∑ Y ) − ( ∑ Y) ] 2 2 2 Tools / data analysis / correlation… 2 8(3142) − (73)(321) [8(713) − (73) 2 ][8(14111) − (321) 2 ] Tree Height Trunk Diameter = 0.886 30 Tree Height Trunk Diameter 1 0.886231 1 20 10 0 0 2 4 6 8 10 12 14 Trunk Diameter, X r = 0.886 → relatively strong positive linear association between X and Y Correlation between Tree Height and Trunk Diameter Yandell – GSBA 502 Chap 3-57 Yandell – GSBA 502 Chapter Summary mean, median, mode Discussed percentiles and quartiles Described measure of variation Final Demonstration Described measures of center and location Chap 3-58 Click here to see a demo of a side-by-side box-and-whisker plot and see how to get summary statistics range, interquartile range, variance, standard deviation, coefficient of variation Created Box-and-Whisker plots Illustrated distribution shapes (symmetric, skewed) Discussed linear correlation Yandell – GSBA 502 Chap 3-59 Yandell – GSBA 502 Chap 3-60