Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Descriptive Statistics Prerequisite Material MGS 8110 Regression & Forecasting Descriptive Analysis of Data Given a bunch of data (simple data for only one variable), how is the best way to summarize the data? L00D MGS 8110 - Descriptive Statistics 2 Measures of Central Tendency • Mean • Median x if n is odd x i n M x n 1 M xn 2 1 2 2 if n is even • M (x n x n ) / 2 2 2 1 • Mode • Mid Range M R (x min x max ) / 2 • Mid InterQuartile M Q ( x .25 x .75 ) / 2 the xi value that occurs most frequently L00D MGS 8110 - Descriptive Statistics 3 Measures of Variability Av Deviate (x x) • Av Absolute Deviate x x • Variance • i n i n s 2 (x x) 2 i n 1 (x x) 2 • Standard Deviation • Range R ( x max x min ) • Inter Quartile Range (IQR) RQ ( x .75 x .25 ) s i n 1 L00D MGS 8110 - Descriptive Statistics 4 Lets Review - Statistical Precepts Use the Mean for quantitative, symmetric data. Use the Medium for quantitative non-symmetric data. Use the mode for categorical data. Use the Variance when doing calculations. Use the Standard Deviation when presenting the results of the calculations. Major Teaching Points are frequently shown in green boxes L00D MGS 8110 - Descriptive Statistics 5 Appropriate Statistics discussed in chapter 2 Quantitative - Bell Shaped Quantitative - NOT Bell Shaped Ordinal Nominal Mean Median Median Mode St Dev Quartile Range Quartile Range NA Histogram Histogram Tables or Bar charts Tables or Bar charts L00D MGS 8110 - Descriptive Statistics 6 More Review s2 (x x) s 2 – Need to memorize formules Used for calculation, but not for presentations. Units are squared (e.g., inches squared). i n 1 ( xi x )2 Used for presentations. Common units (e.g., inches). n 1 Divide by n-1 instead of n in order to get an unbiased estimate. L00D MGS 8110 - Descriptive Statistics 7 Interpretation of Standard Deviation (1 of 2) If the data is normally distributed ( x s , x s ) contains 68.27% of thepopulation ( x 2s , x 2s ) contains 95.45% of thepopulation ( x 3s , x 3s ) contains 99.73 of thepopulation Statistical Precepts Two-thirds of the data is contained in one sigma. 95% of the data is contained in two sigma. Almost all of the data is contained in three sigma. L00D MGS 8110 - Descriptive Statistics 8 Interpretation of Standard Deviation (2 of 2) If ever asked to explain what the Standard Deviation means, say “two-thirds of the data will be within plus or minus one Standard Deviation from the mean”. If ever asked for the “worst case” or “best case” outcome calculate “mean – (2)sigma” and/or “mean + (2)sigma”. Statistical Precepts Definition of Standard Deviation - two-thirds of the data is contained in a range of values that are two sigma wide. Worst case outcome is m – 2s. Best case outcome is m + 2s. L00D MGS 8110 - Descriptive Statistics 9 Other Measures • Percentiles xp is the xi such that p% of the xi < xp (measure of tails) • Quartiles are percentile where p = 25, 50 or 75 the lower, middle or upper quartile. (measure of tails) • C.25 C1 x 1 or x 1 4 n 4 C.50 C2 x n 1 n 3 4 or or x n 1 4 (x n x n ) / 2 2 2 C.75 C3 x 3 x 3 4 n 1 n 4 4 x 3n 1 2 1 C0 x minimum C4 x maximum 4 L00D MGS 8110 - Descriptive Statistics 10 Other Measures • Coefficient of Variation (percentage measure of variability) • Correlation Coefficient (measure of linear association) s Cv x ( xi x )( y i y ) / sx s y n 1 1,1 L00D MGS 8110 - Descriptive Statistics 11 4 Interpretation of Correlation Coefficient n=25 1,1 3 2 Y variable 1 0 -4 -3 -2 -1 0 1 2 3 4 -1 -2 =0 -3 -4 4 4 3 3 2 2 1 1 0 -4 -3 -2 -1 0 1 2 3 4 Y variable Y variable X variable 0 -4 -3 -2 -1 0 -1 -1 -2 -2 -3 -3 -4 X variable =.8 -4 1 2 3 4 =1 X variable L00D MGS 8110 - Descriptive Statistics 12 4 Interpretation of Correlation Coefficient n=25 1,1 3 2 Y variable 1 0 -4 -3 -2 -1 0 1 2 3 4 -1 -2 =0 -3 -4 4 4 3 3 2 2 1 1 0 -4 -3 -2 -1 0 1 2 3 4 Y variable Y variable X variable 0 -4 -3 -2 -1 0 -1 -1 -2 -2 -3 -3 -4 X variable =-.8 -4 1 2 3 4 =-1 X variable L00D MGS 8110 - Descriptive Statistics 13 Interpretation of Correlation Coefficient 1 ( xi x )( yi y ) n 1 sx s y 0.9 Percentage of Variability Explained 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Correlation Coefficient L00D MGS 8110 - Descriptive Statistics 14 Interpretation of Correlation Coefficient (1 of 2) Statistical tests of correlation coefficients are relatively meaningless. These tests are based on the hypothesis that “ = 0”. Based on the previous graph, knowing that a correlation coefficient is greater than zero is not necessarily a valuable piece of information. In terms of “Practical Significance” (compared to “Statistical Significance”), the correlation coefficient has to be at least greater than .5. From the previous graph it can be seen that = .5 only explains 25% of the variability in the data. Statistical Precept must be greater than .5 to be of Practical Significance L00D MGS 8110 - Descriptive Statistics 15 Other Measures • Skew (measure of symmetry) (page 3 of 3) (x x) 3 i 3 n • Kurtosis (measure of peakedness) 3 2 ( xi x ) n 3 , more typicall y 3,3 2 (x x) 4 i 4 n 2 ( xi x ) 2 n 3 3 , more typicall y 3,3 L00D MGS 8110 - Descriptive Statistics 16 Verifying Bell Shape (Normal Distribution) • Negative Skew if the distribution has a ‘long’ tail to the left, measured as skewness -1 • Positive Skew if the distribution has a ‘long’ tail to the right (more common situation), measured as skewness +1 • Symmetric if -1 skewness +1 • Peaked Distribution if Kurtosis is a large positive number ( +1). • Flat Distribution if Kurtosis is a large negative number ( -1). • Normal shape (proportionally S-shaped sides) if Kurtosis near zero. L00D MGS 8110 - Descriptive Statistics 17 Verifying Bell Shape (Normal Distribution) Statistical Precept Bell shaped (Normally distributed) if -1 skewness +1 and if -1 kurtosis +1 L00D MGS 8110 - Descriptive Statistics 18 How is the best way to summarize data? (our original question) • Central Tendency (Mean, Median & Mode) • Variability (Variance & Standard Deviation) • Shape (Percentiles, Skewness & Kurtosis) • Association (correlation) L00D MGS 8110 - Descriptive Statistics 19 Notations Parameter Population Sample Value True Value Mean (average) m X Standard Deviation s s Variance s2 s2 Correlatio n p L00D MGS 8110 - Descriptive Statistics 20 Standard Deviation of Sample Mean sX sX n Called the “Standard Error” of the Mean L00D MGS 8110 - Descriptive Statistics 21 Insert / Function examples (1 of 3) • Mean Average(A1:A10) • Median Median(A1:A10) • Mode Mode(A1:A10) • Mid Range ( MAX(A1:A10) + MIN(A1:A10) ) / 2 • InterQuartile ( Quartile(A1:A10,1) + Quartile(A1:A10,3) ) / 2 L00D MGS 8110 - Descriptive Statistics 22 Insert / Function examples (2 of 3) • Av Deviate NA • Av Absolute Deviate AveDev(A1:A10) • Variance Var(A1:A10) • Standard Deviation StDev(A1:A10) • Range MAX(A1:A10) - MIN(A1:A10) • Inter Quartile Range (IQR) Quartile(A1:A10,3) - Quartile(A1:A10,1) L00D MGS 8110 - Descriptive Statistics 23 Insert / Function examples (3 of 3) • Percentiles Percentile(A1:A10,.05) • Quartiles Quartile(A1:A10,q) where q=0,1,2,3,4 • Coef. of Variation StDev(A1:A10)/Average(A1:A10) • Correlation Correl(A1:A10,B1:B10) • Skew Skew(A1:A10) • Kurtosis Kurt(A1:A10) L00D MGS 8110 - Descriptive Statistics 24 Example Calculations Q. Should I use the Mean or the Median to state the Central value of this data? -0.31 -1.27 =SKEW(Height) =KURT(Height) 0.17 -1.03 =SKEW(Weight) =KURT(Weight) Answer – Both variables have a somewhat peaked distributions (Kurtosis greater than 1), but both variables have very symmetric distributions (non-skewed distribution); hence, use Mean. L00D MGS 8110 - Descriptive Statistics 25 Example Calculations Q. The Standard Deviation for Height is almost 2 inches, what is the practical interpretation of this value? 67.1 71.0 3.87 =B14-B15 =B14+B15 =F10-F9 Answer – The height of 2/3 of the population will vary by less than 4 inches (3.87”). L00D MGS 8110 - Descriptive Statistics 26 Example Calculations Q. What is the height of the shortest person and the tallest person that I may meet today (worst case and best case)? 65.2 72.9 =AvHt-2*StDevHt =AvHt+2*StDevHt Answer – The shortest person will be 5’-5” (65.2”) and the tallest person will be 6’-1” (72.9”). L00D MGS 8110 - Descriptive Statistics 27 Example Calculations Q. What is the height of the shortest person that I may meet over the next year? 66.0 =PERCENTILE(Height,0.01) Answer – The shortest person that I am likely to meet in the foreseeable future will be 5’-6” (66.0”). L00D MGS 8110 - Descriptive Statistics 28 Example Calculations Q. The answers to the two previous questions are not consistent. The 5% values calculated as Mean – 2(Sigma) was 5’-5” where as the 1% value calculated as a Percentile was 5’-6”. Answer – These types of inconsistencies (i.e., errors) will occur with small samples. The procedure used by the PERCENTILE function is based on an interpolated calculation with the two smallest values in the sample. L00D MGS 8110 - Descriptive Statistics 29 Example Calculations Q. Which variable, Height or Weight, has the greatest relative variability? Coef of Variation 3% 11% =StDevHt/AvHt =StDevWt/AvWt Answer – In agreement with our intuition, Weight is 3 to 4 times more variable than height (11/3 = 3.67). L00D MGS 8110 - Descriptive Statistics 30 Example Calculations Q. Is there a relationship between Height and Weight and if so how large is the relationship? 0.783 0.61 =CORREL(Height,Weight) =G23^2 1 0.9 Percentage of Variability Explained 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Answer – The correlation between Height and Weight is .78 which means that about 60% (.61) of the variability in weight is due to differences in Height. Correlation Coefficient L00D MGS 8110 - Descriptive Statistics 31 Example Calculations Q. Given that there is a relationship between Height and Weight, is the relationship linear or nonlinear? 0.783 0.61 =CORREL(Height,Weight) =G23^2 Answer – Simple statistics cannot be used to determine linear versus non-linear, would need to plot the data. The correlation indicates that there is a relatively strong linear relationship, but a plot of the data (Weight vs. Height) may indicate that there is an even stronger non-linear relationship L00D MGS 8110 - Descriptive Statistics 32 Example Calculations Q. Are Height and Weight Normally distributed? -0.31 -1.27 =SKEW(Height) =KURT(Height) 0.17 -1.03 =SKEW(Weight) =KURT(Weight) Answer – Based on out Rule-of-Thumb test (-1 < Skew < +1 and -1 < Kurt < +1), neither of these variables are normally distributed. L00D MGS 8110 - Descriptive Statistics 33 Example Calculations Q. Given that the variables are NOT Normally distributed, why do I care? Answer – You previous interpretation of the Standard Deviation maybe somewhat inaccurate (“The height of 2/3 of the population will vary by less than 4 inches “). Also, you previous interpretation of Worst Case and Best Case maybe somewhat inaccurate (“The shortest person will be 5’-5” and the tallest person will be 6’-1”). L00D MGS 8110 - Descriptive Statistics 34 Example Calculations Q. The average Height is estimated to be 69.1”, how good is that estimate? 67.8 70.3 =AvHt-2*StErrorHt =AvHt+2*StErrorHt Answer – The true average height could be anywhere between 67.8 inches to 70.3 inches. A better estimate could be obtained if a large sample was available. L00D MGS 8110 - Descriptive Statistics 35 More about Variability s (x x) s ( xi x )2 s (x m) 2 Use StDev (or Var) in Excel i n 1 N 2 i n Alternative formulation 1) if every item in the Universe is included in the Sample or 2) The Mean is know with certainty. Use StDevP (or VarP) in Excel L00D MGS 8110 - Descriptive Statistics 36 0.18 0.16 0.14 Probability Density Normal Calculations Probability Density Function for Male Height 0.12 0.10 0.08 0.06 .95 0.04 0.02 0.00 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 76 77 78 79 Height NORMINV(0.95,68.8,2.6)=73.08 Cumulative Distribution Function for Male Height 1.00 .95 0.90 Cumulative Distribution 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 Height L00D MGS 8110 - Descriptive Statistics 37 Probability Density Function for Male Height 0.16 0.14 Probability Density Normal Calculations 0.18 0.12 0.10 0.08 0.06 0.04 0.02 .244 0.00 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 76 77 78 79 Height NORMDIST(67,68.8,2.6,TRUE)=.244 Cumulative Distribution Function for Male Height 1.00 0.90 Cumulative Distribution 0.80 0.70 0.60 0.50 0.40 0.30 .244 0.20 0.10 0.00 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 Height L00D MGS 8110 - Descriptive Statistics 38 x x z i s 0.45 0.40 0.35 Probability Density Standardized Normal Calculations Probability Density Function for Male Height 0.30 0.25 0.20 0.15 0.10 .95 0.05 X variable has mean and StDev of m and s which are estimated by x bar and s. 0.00 -5 -4 -3 -2 -1 0 1 2 3 4 5 4 5 Height Cumulative Distribution Function for Male Height 1.00 NORMSINV(0.95)=1.645 0.80 Cumulative Distribution Z variable has mean=0 and StDev=1. Z is a “standardized normal”. .95 0.90 0.70 0.60 0.50 0.40 0.30 0.20 0.10 NORMSINV(.05)=-1.645 0.00 -5 -4 -3 -2 -1 0 1 2 3 Height L00D MGS 8110 - Descriptive Statistics 39 0.45 0.40 0.35 Probability Density Standardized Normal Calculations Probability Density Function for Male Height 0.30 0.25 0.20 0.15 0.10 0.05 .159 0.00 -5 -4 -3 -2 -1 0 1 2 3 4 5 4 5 Height NORMSDIST(-1)=.159 Cumulative Distribution Function for Male Height NORMSDIST(+1)=.841 1.00 0.90 Cumulative Distribution 0.80 0.70 0.60 0.50 0.40 0.30 0.20 .159 0.10 0.00 -5 -4 -3 -2 -1 0 1 2 3 Height L00D MGS 8110 - Descriptive Statistics 40 t-distribution Probability Density Function for Male Height 0.45 t-distribution is needed if s is not know and estimated by s and n<30. 0.40 Probability Density 0.35 0.30 0.25 0.20 0.15 0.10 .95 0.05 0.00 -5 -4 -3 -2 -1 0 1 2 3 4 5 Height t-Dist df=4 t-Dist df=16 z Normal L00D MGS 8110 - Descriptive Statistics 41 Probability Density Function for Male Height 0.35 0.30 Probability Density t-distribution Calculations 0.40 0.25 0.20 0.15 one-tail 0.10 0.05 .058 0.00 -5 -4 -3 -2 -1 0 1 2 3 4 5 Height TDIST(2,4,1)=.058 Cumulative Distribution Function for Male Height 1.0 TDIST(X, d.f., # tails) 0.9 “t” with tails=1 sums from + infinity. “Z” and “Normal” sums from – infinity. Cumulative Distribution 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 .058 0.0 -5 -4 -3 -2 -1 0 1 2 3 4 Height L00D MGS 8110 - Descriptive Statistics 42 5 two-tail 0.40 0.35 0.30 Probability Density t-distribution Calculations Probability Density Function for Male Height 0.25 0.20 0.15 0.10 0.05 .058 -5 -4 -3 .058 0.00 -2 -1 0 1 2 3 4 5 Height TDIST(2,4,2)=.116 Cumulative Distribution Function for Male Height 1.0 TDIST(X, d.f., # tails) 0.9 “t (2 tail)” sums simultaneously from both – infinity and + infinity. Undefined for negative values of t. Cumulative Distribution 0.8 0.7 0.6 0.5 0.4 0.3 0.2 .116 0.1 0.0 -5 -4 -3 -2 -1 0 1 2 3 4 5 Height L00D MGS 8110 - Descriptive Statistics 43 Loading “Data Analysis” in Office 2003 Both NO / Tools / Add-Ins / Will need to have original MS Office CD. L00D MGS 8110 - Descriptive Statistics 44 Example of Tools / Data Analysis / Descriptive Statistics L00D MGS 8110 - Descriptive Statistics 45 Example Output of Tools / Data Analysis / Descriptive Statistics Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 47.84 2.63 48.00 41.00 11.46 131.25 -0.73 0.29 41 30 71 909 19 L00D MGS 8110 - Descriptive Statistics 46 Loading “Data Analysis” in Office 2007 1) Click the Office button in the upper left hand corner of the Excel. Both NO NO 2) Click the “Excel Options” tab in the bottom righthand cornet of the dropdown menu gotten from step #1. L00D MGS 8110 - Descriptive Statistics 47 Loading “Data Analysis” in Office 2007 Both Both 3) Click Add-Ins in the left banner of the Excel Options menu. NO NO 4) Click “Analysis ToolPak” in the Add-ins menu. Then, select BOTH “Analysis ToolPak” and “Analysisi ToolPak – VBA” 5) Click the Go button at the bottom right hand corner of the Excel Options menu. Don’t click the “OK” button”. L00D MGS 8110 - Descriptive Statistics 48 Loading “Data Analysis” in Office 2007 L00D MGS 8110 - Descriptive Statistics 49 Example of Tools / Data Analysis / Descriptive Statistics L00D MGS 8110 - Descriptive Statistics 50 Precision of numerical results – state “3 Significant Digits” Too Many Digits Too Few Digits Correct 3+ significant digits Wt Wt Wt Mean Mean 76.5 Mean 76.5 Standard Error 4.8 Standard Error 4.78 76.506 Standard Error 4.780314 Median 73.58 Median 73.6 Median 73.6 Mode #N/A Mode #N/A Mode #N/A Standard Deviation 15.11668 Standard Deviation 15.1 Standard Deviation 15.1 Sample Variance 228.514 Sample Variance 228.5 Sample Variance 228.5 Kurtosis -0.67472 Kurtosis -0.7 Kurtosis -0.675 Skewness 0.4 Skewness 0.362 Range 48.0 Range 48.0 Minimum 54.3 Minimum 54.3 Maximum 102.3 Maximum 102.3 Sum 765.1 Sum 765.1 Count 10.0 Count Skewness 0.362102 too few digits Range 47.98 Minimum 54.28 Maximum 102.26 Sum meaningless 765.06 extra digits Count 10 (.651 to .744) OK 10 L00D MGS 8110 - Descriptive Statistics 51 Precision of numerical results – state “3 Significant Digits” (continued) CORRECT 3 significant digits INCORRECT 3 decimal places Wt Wt Mean 76.5 Mean 76.506 Standard Error 4.78 Standard Error 4.780 Median 73.6 Median 73.580 Mode #N/A Mode #N/A Standard Deviation 15.1 Standard Deviation 15.117 Sample Variance 229 Sample Variance 228.514 Kurtosis -0.675 Kurtosis -0.675 Skewness 0.362 Skewness 0.362 Range 48.0 Range 47.980 Minimum 54.3 Minimum 54.280 Maximum 102 Maximum 102.260 Sum 765 Sum 765.060 Count 10 Count 10.000 L00D MGS 8110 - Descriptive Statistics 52 Some Great Rules of Thumb Data is Normally distributed (Bell shaped) if -1 skewness +1 if -1 kurtosis +1 and Data is a potential outlier Symmetric distribution xi < mean – 3s or Skewed distribution xi < Q1 – (1.5)RQ xi > mean + 3 s or xi > Q3 + (1.5)RQ L00D MGS 8110 - Descriptive Statistics 53 Prerequisite Spreadsheet Skills • • • • • • • • • • • • • • • • Cut, Copy, Paste & Paste Special Cell corner Copy Add or delete Rows or Columns Change width/height of row/column Font, alignment, boarder & number of cell Referencing and calculations with cells. Data / Sort Naming cell or range of cells Insert / Function / Average Sum, Max, Min, Count, Small and Large (Tools / Add-ins / Data analysis) Tools / Data Analysis / Descriptive statistics Single quote for equation statement. REPLACE command. DATA / Group & FORMAT: Column, Hide. Grab an entire column of data (CTRL+SHIFT, down arrow). L00D MGS 8110 - Descriptive Statistics 54