Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BA 555 Practical Business Analysis Agenda Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case Study: Cost of Manufacturing Computers Simple Linear Regression 1 The Empirical Rule (p.5) 1. Approximately 68% of the observations will fall within 1 standard deviation of the mean. 2. Approximately 95% of the observations will fall within 2 standard deviations of the mean. 3. Approximately 99.7% of the observations will fall within 3 standard deviations of the mean. 99.7% 99.7% 95% 95% 68% 0.15% x 3s 3 x 3s 3 0.15% 3 3 2.35% 13.5% 68% 34% 34% 13.5% x s x 34%x s 34% x 2s13.5%x 3s x 2s 13.5% x22 s x s x x s2 3 x 2s 1 0 1 2 3 2 2 2 0 1 1 2 2 2.35% 2.35% 0.15% 2.35% 0.15% x 3s 3 3 2 Review Example Suppose that the average hourly earnings of production workers over the past three years were reported to be $12.27, $12.85, and $13.39 with the standard deviations $0.15, $0.18, and $0.23, respectively. The average hourly earnings of the production workers in your company also continued to rise over the past three years from $12.72 in 2002, $13.35 in 2003, to $13.95 in 2004. Assume that the distribution of the hourly earnings for all production workers is mound-shaped. Do the earnings in your company become less and less competitive? Why or why not. 3 Review Example Year Industry average Industry std. 2002 12.27 0.15 2003 12.85 0.18 4.73% 13.35 4.95% 2.77 2004 13.39 0.23 4.20% 13.95 4.50% 2.43 % increase Company average % increase 12.72 Z score 3 4 The Empirical Rule Generalize the results from the empirical rule. Justify the use of the mound-shaped distribution. 99.7% 95% 68% 0.15% 2.35% 13.5% 34% 34% 13.5% 2.35% 0.15% x 3s 3 x 2s xs x xs x 2s 2 2 3 2 1 0 x 3s 3 1 2 3 5 Sampling Distribution (p.6) The sampling distribution of a statistic is the probability distribution for all possible values of the statistic that results when random samples of size n are repeatedly drawn from the population. When the sample size is large, what is the sampling distribution of the sample mean / sample proportion / the difference of two samples means / the difference of two sample proportions? NORMAL !!! 6 Central Limit Theorem (CLT) (p.6) If X ~ N(, ), then X ~ N( 2 X , 2 X 2 n ) Sample 1 : X 11, X 12 , , X 1n X 1 Sample 2 : X 21, X 22 , , X 2 n X 2 Sample 3 : X 31, X 32 , , X 3n X 3 Sample 4 : X 41, X 42 , , X 4 n X 4 ::::::: 7 Central Limit Theorem (CLT) (p.6) If X ~ Any distribution with the mean , and variance 2, then X ~ N( , n ) for large n. 2 X 2 X Sample 1 : X 11, X 12 , , X 1n X 1 Sample 2 : X 21, X 22 , , X 2 n X 2 Sample 3 : X 31, X 32 , , X 3n X 3 Sample 4 : X 41, X 42 , , X 4 n X 4 ::::::: 8 Summary: Sampling Distributions The sampling distribution of a sample mean The sampling distribution of a sample proportion The sampling distribution of the difference between two sample means The sampling distribution of the difference between two sample proportions 9 Standard Deviations Population standard deviation X or simply . Sample standard deviation s X or simply s . Standard deviation of sample means (aka. standard error) X Standard deviation of sample proportions (aka. standard error) p̂ Relationships: o o X pˆ X n sX n p(1 p) n : ˆ X or s X pˆ (1 pˆ ) : ˆ pˆ or s pˆ n 10 Statistical Inference: Estimation Population Research Question: What is the parameter value? Sample of size n Tools (i.e., formulas): Point Estimator Interval Estimator 11 Confidence Interval Estimation (p.7) 12 Example 1: Estimation for the population mean A random sampling of a company’s weekly operating expenses for a sample of 48 weeks produced a sample mean of $5474 and a standard deviation of $764. Construct a 95% confidence interval for the company’s mean weekly expenses. Example 2: Estimation for the population proportion 13 Statistical Inference: Hypothesis Testing Population Research Question: Is the claim supported? Sample of size n Tools (i.e., formulas): z or t statistic 14 Hypothesis Testing (p.9) 15 Example A bank has set up a customer service goal that the mean waiting time for its customers will be less than 2 minutes. The bank randomly samples 30 customers and finds that the sample mean is 100 seconds. Assuming that the sample is from a normal distribution and the standard deviation is 28 seconds, can the bank safely conclude that the population mean waiting time is less than 2 minutes? 16 Setting Up the Rejection Region Type I Error If we reject H0 (accept Ha) when in fact H0 is true, this is a Type I error. False Alarm. 17 The P-Value of a Test (p.11) The p-value or observed significance level is the smallest value of a for which test results are statistically significant. “the conclusion of rejecting H0 can be reached.” 18 Regression Analysis A technique to examine the relationship between an outcome variable (dependent variable, Y) and a group of explanatory variables (independent variables, X1, X2, … Xk). The model allows us to understand (quantify) the effect of each X on Y. It also allows us to predict Y based on X1, X2, …. Xk. 19 Types of Relationship Linear Relationship Simple Linear Relationship Y = b0 + b1 X + e Multiple Linear Relationship Y = b0 + b1 X1 + b2 X2 + … + bk Xk + e Nonlinear Relationship Y = a0 exp(b1X+e) Y = b0 + b1 X1 + b2 X12 + e … etc. Will focus only on linear relationship. 20 Simple Linear Regression Model population Y b 0 b1 X e True effect of X on Y Estimated effect of X on Y sample Key questions: 1. Does X have any effect on Y? 2. If yes, how large is the effect? 3. Given X, what is the estimated Y? Yˆ bˆ0 bˆ1 X 21 Least Squares Method Least squares line: Yˆ bˆ0 bˆ1 X It is a statistical procedure for finding the “best- fitting” straight line. It minimizes the sum of squares of the deviations of the observed values of Y from those predicted Yˆ Deviations are minimized. Bad fit. 22 Case: Cost of Manufacturing Computers (pp.13 – 45) A manufacturer produces computers. The goal is to quantify cost drivers and to understand the variation in production costs from week to week. The following production variables were recorded: COST: the total weekly production cost (in $millions) UNITS: the total number of units (in 000s) produced during the week. LABOR: the total weekly direct labor cost (in $10K). SWITCH: the total number of times that the production process was re-configured for different types of computers FACTA: = 1 if the observation is from factory A; = 0 if from factory B. 23 Raw Data (p. 14) Case FactA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 Units (1000) 1.104 1.044 1.020 0.986 0.972 1.005 0.953 1.083 0.978 0.993 0.958 0.945 1.012 0.974 0.910 1.086 0.962 0.941 1.046 0.955 1.096 1.004 0.997 0.967 1.068 1.041 0.989 1.001 1.008 1.001 0.984 Switch 8 12 12 6 13 11 10 9 9 12 12 13 7 10 11 7 11 10 9 11 12 9 8 13 6 11 16 10 9 7 10 Labor (10,000) 5.591181 6.836490 5.906357 5.050069 4.790412 5.474329 5.614134 6.002122 5.971627 5.679461 4.320123 5.884950 4.593554 4.915151 4.969754 5.722599 6.109507 5.006398 6.141096 5.019560 5.741166 4.990734 4.662818 6.150249 6.038454 4.988593 6.104960 4.605764 5.529746 4.941728 6.456427 Cost (million) 1.155456 1.144198 1.141490 1.119656 1.124815 1.137339 1.121275 1.153224 1.119525 1.134635 1.119386 1.113543 1.132124 1.131238 1.104976 1.151547 1.127478 1.114058 1.140872 1.111290 1.159044 1.127805 1.130661 1.127073 1.141041 1.140319 1.130172 1.135118 1.121326 1.124284 1.115016 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 1 0 1 0 0 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 How many possible regression models can we build? 1.012 0.974 0.910 1.086 0.962 0.941 1.046 0.955 1.096 1.004 0.997 0.967 1.068 1.041 0.989 1.001 1.008 1.001 0.984 0.981 0.944 0.967 1.018 0.902 1.049 1.024 1.044 1.018 0.937 0.942 1.061 0.901 1.078 1.030 0.981 1.011 1.016 1.008 1.059 1.019 7 10 11 7 11 10 9 11 12 9 8 13 6 11 16 10 9 7 10 12 11 10 9 9 11 11 7 9 11 9 11 7 9 10 8 10 9 9 11 13 4.593554 4.915151 4.969754 5.722599 6.109507 5.006398 6.141096 5.019560 5.741166 4.990734 4.662818 6.150249 6.038454 4.988593 6.104960 4.605764 5.529746 4.941728 6.456427 7.058013 4.626091 4.054482 5.820684 4.932339 5.798058 5.528302 6.635490 5.617445 5.275923 2.927715 6.750682 5.029670 7.005407 4.885713 6.362366 6.261692 5.677634 6.630767 6.930117 6.415978 1.132124 1.131238 1.104976 1.151547 1.127478 1.114058 1.140872 1.111290 1.159044 1.127805 1.130661 1.127073 1.141041 1.140319 1.130172 1.135118 1.121326 1.124284 1.115016 1.124353 1.116318 1.128517 1.150238 1.094061 1.143793 1.145135 1.142156 1.140285 1.114418 1.115774 1.154069 1.105335 1.153367 1.146934 1.130423 1.130929 1.136349 1.140616 1.154121 1.142435 24 Simple Linear Regression Model (pp. 17 – 26) Question1: Is Labor a significant cost driver? This question leads us to think about the following model: Cost = f(Labor) + e. Specifically, Cost = b0 + b1 Labor + e Question 2: How well does this model perform? (How accurate can Labor predict Cost?) This question leads us to try other regression models and make comparison. 25 Initial Analysis (pp. 15 – 16) Summary statistics + Plots (e.g., histograms + scatter plots) + Correlations Things to look for Features of Data (e.g., data range, outliers) do not want to extrapolate outside data range because the relationship is unknown (or un-established). Summary statistics and graphs. Is the assumption of linearity appropriate? Inter-dependence among variables? Any potential problem? Scatter plots and correlations. 26 Correlation (p. 15) r (rho): Population correlation (its value most likely is unknown.) r: Sample correlation (its value can be calculated from the sample.) Correlation is a measure of the strength of linear relationship. Correlation falls between –1 and 1. No linear relationship if correlation is close to 0. But, …. r = –1 r = –1 –1 < r < 0 –1 < r < 0 r=0 r=0 0<r<1 0<r<1 r=1 r=1 27 Correlation (p. 15) Sample size Cost Cost Units 0.9297 ( 52) 0.0000 Units 0.9297 ( 52) 0.0000 Switch -0.0232 ( 52) 0.8706 -0.1658 ( 52) 0.2402 Labor 0.4520 ( 52) 0.0008 0.4603 ( 52) 0.0006 Switch -0.0232 ( 52) 0.8706 Labor 0.4520 ( 52) 0.0008 -0.1658 ( 52) 0.2402 0.4603 ( 52) 0.0006 0.1554 ( 52) 0.2714 0.1554 ( 52) 0.2714 P-value for Is 0.9297 a r or r? H0: r = 0 Ha: r ≠ 0 28 Fitted Model (Least Squares Line) (p.18) Regression Analysis - Linear model: Y = a + b*X Dependent variable: Cost Independent variable: Labor Standard T Parameter Estimate Error Statistic Intercept Slope 1.08673 0.00810182 P-Value 0.0127489 85.2409 0.0000 0.00226123 3.58293 0.0008 Analysis of Variance Source Sum of Squares Df Mean Square Model 0.00231465 1 0.00231465 b or b ? b 0 0 0 b1 or b1 b1? 0.00901526 Sb0 Residual 50 0.000180305 Total (Corr.) 0.0113299 Sb151 Yˆ 1.08673 0.0081X Estimate - 0 T Statistic Standard Error F-Ratio 12.84 P-Value 0.0008 H0: b1 = 0 Ha: b1 ≠ 0 ** Divide the p-value by 2 for one-sided test. Make sure there is at least weak evidence for doing this step. Degrees of freedom = n – k – 1, where n = sample size, k = # of Xs. 29 Hypothesis Testing and Confidence Interval Estimation for b (pp. 19 – 20) Q1: Does Labor have any impact on Cost → Hypothesis Testing Q2: If so, how large is the impact? → Confidence Interval Estimation Regression Analysis - Linear model: Y = a + b*X Dependent variable: Cost Independent variable: Labor Standard T Parameter Estimate Error Statistic Intercept Slope 1.08673 0.00810182 Source Model b1 b0 Residual Total (Corr.) P-Value 0.0127489 85.2409 0.0000 0.00226123 3.58293 0.0008 Analysis of Variance Sum of Squares Df Mean Square 0.00231465 1 0.00231465 S Sb0 b1 0.00901526 50 0.000180305 0.0113299 51 F-Ratio 12.84 Degrees of freedom = n – k – 1 k = # of independent variables P-Value 0.0008 30 Analysis of Variance (p. 21) Analysis of Variance Sum of Squares Df Mean Square Source 0.00231465 0.00901526 0.0113299 Model Residual Total (Corr.) 1 50 51 0.00231465 0.000180305 F-Ratio P-Value 12.84 0.0008 - Not very useful in simple regression. - Useful in multiple regression. n Syy = SS Total = (y i 1 i y ) 2 0.0113299. n SSR =SS of Regression Model = ( yˆ i y ) 2 0.00231465. i 1 n SSE = SS of Error = (y i 1 i yˆ i ) 2 0.00901526. SS Total = SS Model + SS Error. 31 Sum of Squares (p.22) Syy = Total variation in Y SSE = remaining variation that can not be explained by the model. SSR = Syy – SSE = variation in Y that has been explained by the model. 32 Fit Statistics (pp. 23 – 24) Source Analysis of Variance Sum of Squares Df Mean Square Model 0.00231465 Residual 0.00901526 1 0.002315 50 0.000180 Total (Corr.) 0.0113299 51 F-Ratio P-Value 12.84 0.0008 Correlation Coefficient = 0.45199 0.45199 x 0.45199 = 0.204295 R-squared = 20.4295 percent R-squared (adjusted for d.f.) = 18.8381 percent Standard Error of Est. = 0.0134278 SS Model SS Re sidual 1 SSTotal SSTotal SSResidual MS Re sidual n k 1 33 Prediction (pp. 25 – 26) What is the predicted production cost of a given week, say, Week 21 of the year that Labor = 5 (i.e., $50,000)? Point estimate: predicted cost = b0 + b1 (5) = 1.0867 + 0.0081 (5) = 1.12724 (million dollars). Margin of error? → Prediction Interval What is the average production cost of a typical week that Labor = 5? Point estimate: estimated cost = b0 + b1 (5) = 1.0867 + 0.0081 (5) = 1.12724 (million dollars). Margin of error? → Confidence Interval 100(1-a)% prediction interval: ( xg x )2 1 yˆ ta / 2 Standard Error of Est. ) 1 , n n 1)variance of X ) 100(1-a)% confidence interval: yˆ ta / 2 Standard Error of Est. ) ( xg x )2 1 , n n 1)variance of X ) 34 Prediction vs. Confidence Intervals (pp. 25 – 26) X 3.0 4.0 5.0☻ 6.0 Predicted Y 1.11103 1.11913 1.12724☻ 1.13534 95.00% Prediction Limits Lower Upper 1.08139 1.14067 1.09098 1.14729 1.09988☻ 1.15459☻ 1.10804 1.16263 95.00% Confidence Limits Lower Upper 1.09874 1.12332 1.11105 1.12722 1.12267☻ 1.1318☻ 1.13113 1.13954 95% Prediction and Confidence Intervals for Cost Cost ($ million) 1.17 ☺ 1.15 ☺ ☺ ☺ 1.13 1.11 ☺ 1.09 ☺ 2.9 3.9 4.9 5.9 6.9 Labor ($10,000) Variation (margin of error) on both ends seems larger. Implication? 7.9 35 Another Simple Regression Model: Cost = b0 + b1 Units + e (p. 27) Regression Analysis - Linear model: Y = a + b*X Standard T Parameter Estimate Error Statistic Intercept 0.849536 0.0158346 53.6506 Slope 0.281984 0.0157938 17.8541 P-Value 0.0000 0.0000 Analysis of Variance Source Model Residual Total (Corr.) Sum of Squares 0.00979373 0.00153618 0.0113299 Df Mean Square 1 0.00979373 50 0.0000307235 51 F-Ratio 318.77 P-Value 0.0000 Correlation Coefficient = 0.929739 R-squared = 86.4414 percent R-squared (adjusted for d.f.) = 86.1702 percent Standard Error of Est. = 0.00554288 95% Prediction and Confidence Intervals for Cost Cost ($ million) 1.17 1.15 A better model? Why? 1.13 1.11 1.09 0.9 0.94 0.98 1.02 1.06 Units (1000) 1.1 1.14 1.18 36 Statgraphics Simple Regression Analysis Relate / Simple Regression X = Independent variable, Y = dependent variable For prediction, click on the Tabular option icon and check Forecasts. Right click to change X values. Multiple Regression Analysis Relate / Multiple Regression For prediction, enter values of Xs in the Data Window and leave the corresponding Y blank. Click on the Tabular option icon and check Reports. 37 Normal Probabilities z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 .00 .0000 .0398 .0793 .1179 .1554 .1915 .2257 .2580 .2881 .3159 .3413 .3643 .3849 .4032 .4192 .4332 .4452 .4554 .4641 .4713 .4772 .4821 .4861 .4893 .4918 .4938 .4953 .4965 .4974 .4981 .4987 .01 .0040 .0438 .0832 .1217 .1591 .1950 .2291 .2611 .2910 .3186 .3438 .3665 .3869 .4049 .4207 .4345 .4463 .4564 .4649 .4719 .4778 .4826 .4864 .4896 .4920 .4940 .4955 .4966 .4975 .4982 .4987 .02 .0080 .0478 .0871 .1255 .1628 .1985 .2324 .2642 .2939 .3212 .3461 .3686 .3888 .4066 .4222 .4357 .4474 .4573 .4656 .4726 .4783 .4830 .4868 .4898 .4922 .4941 .4956 .4967 .4976 .4982 .4987 .03 .0120 .0517 .0910 .1293 .1664 .2019 .2357 .2673 .2967 .3238 .3485 .3708 .3907 .4082 .4236 .4370 .4484 .4582 .4664 .4732 .4788 .4834 .4871 .4901 .4925 .4943 .4957 .4968 .4977 .4983 .4988 .04 .0160 .0557 .0948 .1331 .1700 .2054 .2389 .2704 .2995 .3264 .3508 .3729 .3925 .4099 .4251 .4382 .4495 .4591 .4671 .4738 .4793 .4838 .4875 .4904 .4927 .4945 .4959 .4969 .4977 .4984 .4988 .05 .0199 .0596 .0987 .1368 .1736 .2088 .2422 .2734 .3023 .3289 .3531 .3749 .3944 .4115 .4265 .4394 .4505 .4599 .4678 .4744 .4798 .4842 .4878 .4906 .4929 .4946 .4960 .4970 .4978 .4984 .4989 .06 .0239 .0636 .1026 .1406 .1772 .2123 .2454 .2764 .3051 .3315 .3554 .3770 .3962 .4131 .4279 .4406 .4515 .4608 .4686 .4750 .4803 .4846 .4881 .4909 .4931 .4948 .4961 .4971 .4979 .4985 .4989 .07 .0279 .0675 .1064 .1443 .1808 .2157 .2486 .2794 .3078 .3340 .3577 .3790 .3980 .4147 .4292 .4418 .4525 .4616 .4693 .4756 .4808 .4850 .4884 .4911 .4932 .4949 .4962 .4972 .4979 .4985 .4989 .08 .0319 .0714 .1103 .1480 .1844 .2190 .2517 .2823 .3106 .3365 .3599 .3810 .3997 .4162 .4306 .4429 .4535 .4625 .4699 .4761 .4812 .4854 .4887 .4913 .4934 .4951 .4963 .4973 .4980 .4986 .4990 .09 .0359 .0753 .1141 .1517 .1879 .2224 .2549 .2852 .3133 .3389 .3621 .3830 .4015 .4177 .4319 .4441 .4545 .4633 .4706 .4767 .4817 .4857 .4890 .4916 .4936 .4952 .4964 .4974 .4981 .4986 .4990 38 Critical Values of t DEGREES OF FREEDOM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 t.100 t.050 t.025 t.010 t.005 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 DEGREES OF FREEDOM 24 25 26 27 28 29 30 35 40 45 50 60 70 80 90 100 120 140 160 180 200 ? t.100 t.050 t.025 t.010 t.005 1.318 1.316 1.315 1.314 1.313 1.311 1.310 1.306 1.303 1.301 1.299 1.296 1.294 1.292 1.291 1.290 1.289 1.288 1.287 1.286 1.286 1.282 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.690 1.684 1.679 1.676 1.671 1.667 1.664 1.662 1.660 1.658 1.656 1.654 1.653 1.653 1.645 2.064 2.060 2.056 2.052 2.048 2.045 2.042 2.030 2.021 2.014 2.009 2.000 1.994 1.990 1.987 1.984 1.980 1.977 1.975 1.973 1.972 1.960 2.492 2.485 2.479 2.473 2.467 2.462 2.457 2.438 2.423 2.412 2.403 2.390 2.381 2.374 2.368 2.364 2.358 2.353 2.350 2.347 2.345 2.326 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.724 2.704 2.690 2.678 2.660 2.648 2.639 2.632 2.626 2.617 2.611 2.607 2.603 2.601 2.576 39