Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Integrated Approach to Improving Quality and Efficiency CHAPTER 7. USING DATA AND STATISTICAL TOOLS FOR OPERATIONS IMPROVEMENT Chapter 7 Using Data and Statistical Tools for Operations Management Using Data and Statistical Tools for Operations Management • • • • • • • Data collection Graphical tools Mathematical descriptions Probability and probability distributions Confidence intervals, hypothesis tests ANOVA/MANOVA/MANCOVA Regression Copyright 2012 Health Administration Press Data Collection • Validity: A valid study has no logic, sampling, or measurement errors. — Logic — Selection or sampling — Measurement Copyright 2012 Health Administration Press Data Collection Diagram created in Inspiration by Inspiration Software, Inc. Copyright 2012 Health Administration Press Data Collection: Logic • Why are the data needed? • What will the data be used for? • What questions are going to be asked of the data? • Are the patterns of the past going to be repeated in the future? Copyright 2012 Health Administration Press Data Collection: Selection or Sampling • • • • • • • Census versus sample Nonrandom methods Simple random sampling Stratified sampling Systematic or sequential sampling Cluster or area sampling Sample size Copyright 2012 Health Administration Press Data Collection: Measurement • Accuracy • Precision — How precise should the measurements be? — Does the measurement measure what we want it to measure (i.e., say = do)? • Reliability — Would the measurement be the same if we repeated it? Reliable, but not accurate Copyright 2012 Health Administration Press Reliable and accurate Not reliable, but accurate Graphical Tools • • • • • • Mapping Visual representations of data Histograms and Pareto charts Stem plots, dot plots Box (and whisker) plots Normal probability plots Copyright 2012 Health Administration Press Graphical Tools: Histograms and Pareto Charts Length of Hospital Stay Diagnosis Category 14 12 10 Frequency 12 6 4 0 H ea rt D 2 0 1-2 3-4 5-6 7-8 9-10 11-12 13-14 15-16 17-18 Length of Hospital Stay (days) M 2 el iv er 6 y Pn u al em ig on na ia nt N eo pl as m s Ps yc ho se s Fr ac tu re s 4 D 8 8 ise as e Frequency 10 Diagnosis Microsoft Excel screen shots reprinted with permission from Microsoft Corporation. Copyright 2012 Health Administration Press Graphical Tools: Dot Plots Dotplot of C1 Length of Hospital Stay 3 6 9 12 Days 15 Produced with Minitab statistical software Copyright 2012 Health Administration Press 18 Graphical Tools: Turnip Graph Percentage of diabetic Medicare enrollees receiving eye exams among 306 hospital referral regions (2001) Source: Wennberg, J. E. 2005. Data from the Dartmouth Atlas Project. Figure copyrighted by the Trustees of Dartmouth College. Used with permission. Copyright 2012 Health Administration Press Graphical Tools: Normal Probability Plots Length of Hospital Stay 1.00 .75 .50 .25 0.00 0.00 .25 .50 .75 Observed Cumulative Probability Produced with SPSS for Windows Copyright 2012 Health Administration Press 1.00 Graphical Tools: Scatter Plots Strong Positive Correlation Strong Negative Correlation Y Y r = -0.86 X r = 0.91 Positive Correlation X No Correlation Y Y r = 0.70 X r = 0.06 Microsoft Excel screen shots reprinted with permission from Microsoft Corporation. Copyright 2012 Health Administration Press X Mathematical Descriptions: Mean • The mean is the arithmetic average of the population: Population mean μ x , where x individual values and N N number of values in the population. • The population mean can be estimated from a sample: x Sample mean x , where n number of values in the sample. n For our simple data set, x Copyright 2012 Health Administration Press 3 6853 5. 5 Mathematical Descriptions: Median and Mode • The median is the middle value of the sample or population. If the data are arranged into an array (an ordered data set): 3, 3, 5, 6, 8 5 would be the middle value or median. • The mode is the most frequently occurring value. In the above example, the value 3 occurs more often (two times) than any other value, so 3 would be the mode. Copyright 2012 Health Administration Press Mathematical Descriptions Range and Mean Absolute Deviation • The range is the difference between the high and low values in a data set. Range xhigh xlow 8 3 5 • The mean absolute deviation (MAD) is the average of the absolute value of the differences from the mean. xx MAD n Copyright 2012 Health Administration Press 2 2 0 1 3 8 1.6 5 5 Mathematical Descriptions Variance, Standard Deviation • The variance is the average square difference from the mean. (x μ) 4 4 0 1 9 18 Population variance σ 3.6 2 2 Sample variance s 2 N 2 (x x ) n-1 5 5 4 4 0 1 9 18 4.5 5 1 4 • This standard deviation is the square root of the variance. (x μ) 2 Population standard deviation σ 2 Sample standard deviation s 2 Copyright 2012 Health Administration Press N (x x) n 2 4 4 0 1 9 18 3.6 1.9 5 5 4 4 0 1 9 18 4.5 2.1 5 1 4 Mathematical Descriptions Coefficient of Variation The coefficient of variation (CV) is a measure of the relative variation in the data. It is the standard deviation divided by the mean. σ s 1.9 CV or 0.4 μ x 5 Copyright 2012 Health Administration Press Probability and Probability Distributions • • • • • Determination of probabilities Properties of probabilities Probability distributions Discrete probability distributions Continuous probability distributions Copyright 2012 Health Administration Press Determination of Probabilities Observed Probability Observed probability is the relative frequency of an event—the number of times the event occurred divided by the total number of trials. P(A) Number of times A occured r Total number of observations, trials, or experiments n P(drug is effective) Copyright 2012 Health Administration Press Number of times patients are cured r Total number of patients given the drug n Determination of Probabilities Theoretical Probability Theoretical probability is the theoretical relative frequency of an event; the theoretical number of times an event will occur divided by the total number of possible outcomes. Number of times A could occur r P(A) Total number of possible outcomes n P(card is a spade) Number of spades in the deck 13 0.25 Total number of cards in the deck 52 Copyright 2012 Health Administration Press Determination of Probabilities Opinion Probability Opinion probability is a subjective determination of the number of times an event will occur divided by the imaginary total number of possible outcomes or trials. Opinion of number of times an event will occur r P(A) Theoretical total n P(Secretariat winning the Belmont Stakes) Opinion on the number of times Secretariat would win the Belmont r Imaginary total number of times the Belmont would be run n Copyright 2012 Health Administration Press Properties of Probabilities Bounds on Probability • Probabilities always must be 0, and an event that cannot occur has a probability of 0. Least number of times A could occur 0 P(A) 0 Total number of possible outcomes Any number • Probabilities must always be 1. P(A) Greatest number of times A could occur n 1 Total number of possible outcomes n 0 P(A) 1 • P(A) + P(A') = 1 and 1 − P(A') = P(A), where A' is not A. Copyright 2012 Health Administration Press Properties of Probabilities Multiplicative Property For two independent events, the probability of both A and B occurring, or the intersection () of A and B, is the probability of A occurring times the probability of B occurring. P(A and B occurring) = P(A B) = P(A) × P(B) Copyright 2012 Health Administration Press Properties of Probabilities: Multiplicative Property Coin Toss H Die Toss Probability 1 1/12 2 1/12 3 1/12 4 1/12 5 1/12 6 1/12 1 1/12 2 1/12 3 1/12 4 1/12 5 1/12 6 1/12 P(3) = 1/6 P(H) × P(3) = P(H 3) = 1/12 Start T P(H) = 1/2 Copyright 2012 Health Administration Press 1/2 × 1/6 = 1/12 Properties of Probabilities Additive Property • For two events, the probability of A or B occurring, or the union () of A with B, is the probability of A occurring plus the probability of B occurring, minus the probability of both A and B occurring. P(A or B occurring) = P(A B) = P(A) + P(B) + P(A B) Copyright 2012 Health Administration Press Properties of Probabilities: Additive Property Coin Toss H Die Toss Probability 1 1/12 2 1/12 3 1/12 4 1/12 5 1/12 6 1/12 P(H 3) = 7/12 Start T P(H) = 1/2 Copyright 2012 Health Administration Press 1 1/12 2 1/12 3 1/12 4 1/12 5 1/12 6 1/12 P(3) = 1/6 P(H) + P(3) − P(H 3) = 7/12 Properties of Probabilities: Conditional Probability The probability of an event occurring if more information is obtained: P( A B) P( A B) P( B) Contingency Table for ER Wait Times 30 minute wait >30 minute wait Friday night 20 30 50 Other times 40 10 50 60 40 100 Copyright 2012 Health Administration Press Properties of Probabilities: Conditional Probability • Note that: P( A B) P( A B) P( B) P( B A) P( A) and if one event has no effect on the other event (the events are independent), then . P( A B) P( A) and P( A B) P( A) P( B) • Bayes’ theorem P( B A) P( A) P( A B) P( B A) P( A) P( A B) P( B) P( B) P( B A) P( A) P( B A) P( A) Copyright 2012 Health Administration Press Confidence Intervals, Hypothesis Testing • • • • • • Central limit theorem Hypothesis testing Type I () and Type II () errors T-tests Proportions Practical significance versus statistical significance Copyright 2012 Health Administration Press Confidence Intervals, Hypothesis Testing Central Limit Theorem • As the sample size becomes large, the sampling distribution of the mean approaches normality, no matter what the distribution the original variable, and x and x n Sampling Distribution Simulation Copyright 2012 Health Administration Press Confidence Intervals Confidence interval for the true value of the population mean: x z * x z * /2 x z / 2 * /2 x n . x z / 2 * x n 95% P(X) 0.4 0.2 2.5% 2.5% 0 -3 -2 Copyright 2012 Health Administration Press -1 0 Z 1 2 3 Hypothesis Testing • Belief or null hypothesis, Ho: = b • Alternate belief or hypothesis, Ha: b • Decision rule: If z z*, reject the null x : hypothesis. Where z x -Z*< Z < Z* (95% confidence) P(X) 0.4 0.2 Z<-Z* Z>Z* 0 -3 -2 Copyright 2012 Health Administration Press -1 0 Z 1 2 3 Hypothesis Testing: Type I () and Type II () Errors Ho: 1=2 Ha: 12 Type I and Type II Error—Clinic Wait Time Example Reality Wait times at Wait times at the the two clinics two clinics are are the same NOT the same 1=2 Assessment or guess Wait times at the two clinics are the same 1=2 Wait times at the two clinics are NOT the same 12 Copyright 2012 Health Administration Press 12 Type II or error Type I or error Equal Variance t-Test • t-tests are used to test hypotheses about two means. • Ho: 1=2 Ha: 12 • Decision rule: If t t*, reject Ho (x x ) (μ1 μ2 ) t 1 2 1 1 sp n1 n2 (n1 1) s12 (n2 1) s22 where s p n1 n2 2 • Confidence interval 1 1 1 1 * ( x1 x2 ) t * s p 1 2 ( x1 x2 ) t * s p n1 n2 n1 n2 * Copyright 2012 Health Administration Press Proportions Ho: 1= 2 Ha: 12 Decision rule: If z z*, reject Ho ( p1 p2 ) (1 2 ) z where p (1 p ) p (1 p ) n1 n2 n1 p1 n2 p2 p n1 n2 Confidence interval ( p1 p2 ) z * p (1 p ) p (1 p ) p (1 p ) p (1 p ) * 1 2 ( p1 p2 ) z n1 n2 n1 n2 Copyright 2012 Health Administration Press Practical Significance Versus Statistical Significance • Basic confidence interval statistic – [(z*) * (s.e. statistic)] parameter statistic + [(z*) * (s.e. statistic)] • As n increases, s.e. decreases and the confidence interval gets larger. • Large samples may give statistically significant results that are not practically significant. Copyright 2012 Health Administration Press ANOVA/MANOVA/MANCOVA • One-way ANalysis Of VAariance (ANOVA) is used to test hypotheses about three or more levels of treatment. A t-test will give the same information as an ANOVA when there are only two treatment levels of interest. • Two-way and higher ANOVAs are used when there is more than one type of treatment variable of interest. • MANOVA/MANCOVA are used when there is more than one outcome or dependent variable of interest. Copyright 2012 Health Administration Press Regression • Simple linear regression—used to describe the relationship between two variables • Multiple regression—used to describe the relationship between multiple predictor variables and a single dependent variable • General linear model • Artificial neural networks • Design of experiments Copyright 2012 Health Administration Press What Is the Equation of a Line? Algebra: y mx b Statistics: Ŷ bX a Where rise Δy b slope run Δx Copyright 2012 Health Administration Press a y intercept y, when x 0 Problem Student A owns a health insurance firm and wants us to determine the cost (price would be a more difficult problem) of providing healthcare to insured individuals. Copyright 2012 Health Administration Press Seeing the Future Data Experiences are relevant Judgment: To what degree are these experiences still relevant? Experiences are irrelevant Deductive reasoning versus inductive reasoning Copyright 2012 Health Administration Press What Is the Cost of Healthcare Related To? Quantitative ______________ ______________ ______________ ______________ ______________ ______________ Copyright 2012 Health Administration Press Qualitative _____________ _____________ _____________ _____________ _____________ _____________ Selection • • • • Define population Census or sample Type of sample Measurement—accurate, reliable, precise? X = number of dependents; Y = annual healthcare expense ($1,000) • Is the study valid? • How do we create knowledge from data? Copyright 2012 Health Administration Press Data Copyright 2012 Health Administration Press Number of Dependents Annual Healthcare Expense ($1,000) 0 3 1 2 2 6 3 7 4 7 Scatterplot Y—Annual Healthcare Cost $1,000 10 y = 1.3x + 2.4 9 8 7 6 y=x+3 5 y=5 y = 1.2x + 2 4 3 2 1 0 0 1 2 3 X—Number of Dependents Copyright 2012 Health Administration Press 4 5 6 Scatterplot Questions • Which is the “best” line on the scatterplot? • How would you define “best” (e.g., must be quantifiable)? Copyright 2012 Health Administration Press Professor’s Model Ŷ bX a Ŷ cost estimate ($1,000) a Y intercept 3 Y b slope 1 X Ŷ 1X 3 knowledge Copyright 2012 Health Administration Press Model Comparison X Y Yhat = X+3 Prof’s e= Y − Yhat Yˆ 1.2( X ) Yˆ 1.3( X ) 2.4 2 Student 1 e Student 2 e 0 3 3 0 −1 −0.6 1 2 4 -2 1.2 1.7 2 6 5 1 −1.6 −1 3 7 6 1 −1.4 −0.7 4 7 7 0 −0.2 0.6 0 −3 0 (sum) Copyright 2012 Health Administration Press Good Model • A good model must be unbiased. e = 0 • Is that enough? What else? Does this remind you of 2? • How do we get rid of signs? Copyright 2012 Health Administration Press Model Comparison X Y Yhat = X+3 e= Y − Yhat e2 Student 1 e2 0 3 3 0 0 1 1 2 2 −2 4 1.44 2 6 6 1 1 2.56 3 7 7 1 1 1.96 4 7 7 0 0 0.04 (sum) 25 25 0 6 7 Copyright 2012 Health Administration Press Least Squares Technique Gauss proved that if you use: (Y Y)(X X) b and a Y bX 2 (X X) You are guaranteed that e = 0 and e2 is a minimum. Yhat = 1.3X + 2.4, e = 0, and e2 = 5.1. Copyright 2012 Health Administration Press Coefficient of Determination Are we better off making estimates by using information (X = number of dependents) and having created knowledge (Yhat = 1.3X + 2.1) than using no information or knowledge (i.e., is the model “better”)? How would you estimate without using our knowledge (our model)? Copyright 2012 Health Administration Press Sum of Squares Total X Y Yhat = Ybar e=Y− Ybar SSTO (Y − Ybar)2 0 3 5 −2 4 1 2 5 −3 9 2 6 5 1 1 3 7 5 2 4 4 7 5 2 4 (sum) 25 25 0 22 Note that this method is unbiased. Copyright 2012 Health Administration Press Graph 10 Y—Annual Healthcare Cost $1,000 9 8 7 6 5 y=5 4 3 2 1 0 0 1 2 3 4 X—Number of Dependents Copyright 2012 Health Administration Press 5 6 Y—Annual Healthcare Costs $1,000 Errors 8 7 6 5 4 3 2 1 0 0 0.5 1 1.5 2 2.5 3 X—Number of Dependents Copyright 2012 Health Administration Press 3.5 4 4.5 Sum of Squares Error e= Y− Yhat SSE e2 = (Y − Yhat)2 X Y Yhat = 1.3X + 2.4 0 3 2.4 0.6 0.36 5 −2 4 1 2 3.7 −1.7 2.89 5 −3 9 2 6 5 1.0 1.00 5 1 1 3 7 6.3 0.7 0.49 5 2 4 4 7 7.6 −0.6 0.36 5 2 4 (sum) 25 25 0 5.1 25 0 22 Copyright 2012 Health Administration Press Ybar Y− Ybar SSTO (Y − Ybar)2 Coefficient of Determination What is the percentage of improvement when we use knowledge gained from our model? New error level old error level % improvement Old error level 5.1 22 16.9 100 77% 22 22 r2 = coefficient of determination = 77% r2 = 0.77 Copyright 2012 Health Administration Press Another Viewpoint Variation in cost of removal is either explained by knowledge (the model) or not explained. Copyright 2012 Health Administration Press Explained and Unexplained Error Y—Annual Healthcare Costs $1,000 8 7 6 5 4 3 ----- Explained 2 ___ Unexplained 1 0 0 0.5 1 1.5 2 2.5 3 X—Number of Dependents Copyright 2012 Health Administration Press 3.5 4 4.5 Sum of Squares Regression e= Y− Yhat SSE e2 = (Y − Yhat)2 SSTO (Y − Ybar)2 Yhat – Ybar SSR (Yhat − Ybar)2 X Y Yhat = 1.3X + 2.4 0 3 2.4 0.6 0.36 5 −2 4 −2.6 6.76 1 2 3.7 −1.7 2.89 5 −3 9 −1.3 1.69 2 6 5 1.0 1.00 5 1 1 0 0 3 7 6.3 0.7 0.49 5 2 4 1.3 1.69 4 7 7.6 −0.6 0.36 5 2 4 2.6 6.76 (sum) 35 25 0 5.1 25 0 22 0 16.9 Y Y− bar Ybar Coefficient of Determination Explained SSR 16.9 r 0.77 Total SSTO 22.0 2 Note: r2 is not based on statistics or probability; it is just a percentage. Copyright 2012 Health Administration Press Correlation Coefficient r = r2 r = Correlation coefficient = Measure of the strength of the linear relationship between two variables −1 r 1 r = −1 Copyright 2012 Health Administration Press r = +1 Correlation Coefficient Examples r = 0.0 r = 0.9 r = −0.5 Copyright 2012 Health Administration Press Coefficient of Determination Questions: • If r2 is low, does that mean there is no relationship between your variables? • If r2 is high (close to 1), does that mean you always get useful predictions from your model? • If r2 is high, does that mean your model has a “good” fit? Copyright 2012 Health Administration Press r2 and Curves • Can we fit a straight line to this? • Yes, and we are guaranteed that the errors sum to zero and are a minimum. • However, a curve would be better. Y X Copyright 2012 Health Administration Press Excel Output To get this sheet, go to Tools -> Data Analysis -> Regression. If you don't have Data Analysis listed in your tools, see Excel help "Install and Use the Analysis ToolPak.” X—Number of Dependents SUMMARY OUTPUT Regression Statistics Multiple R 0.8765 R Square 0.7682 Adjusted R Square 0.6909 Standard Error 0.8790 Observations 5 SS 7.6818 2.3182 10 Coefficients Standard Error -0.9545 1.0162 0.5909 0.1874 MS 7.6818 0.7727 F Significance F 9.9412 0.0511 Residual Plot 1.0000 0.5000 t Stat P-value Lower 95% Upper 95% Lower 0.0000 90.0% Upper 90.0% -0.9393 0.4169 -4.1885 2.2794 -3.3460 1.4369 2 -0.5000 0 -1.0000 Residuals Intercept Y - $ 1000 Annual Health Care Expense 1 3 4 Predicted X 3.1530 RESIDUAL OUTPUT Predicted X Number of Standard Observation Dependents Residuals Residuals 1 0.8182 -0.8182 -1.0747 2 0.2273 0.7727 1.0150 3 2.5909 -0.5909 -0.7762 4 3.1818 -0.1818 -0.2388 5 3.1818 0.8182 1.0747 Copyright 2012 Health Administration Press 0.0511 -0.0055 1.1873 PROBABILITY OUTPUT X - Number of Percentile Dependents 10 0 30 1 50 2 70 3 90 4 4 6 8 Y—$ 1,000 Annual Healthcare Expense 1.0320 0.1499 X—Number of Dependents df X—Number of Dependents —Number of 0 2 4 6 8 Dependents Y—$ 1,000 Annual Healthcare Expense ANOVA Regression Residual Total Line Fit Plot 5 4 3 2 1 0 Normal Probability Plot 5 0 0 20 40 60 Sample Percentile 80 100 F Test MSR SSR / 1 F* MSE SSE / n 2 If F* > F(1-;1;n-2), reject H0: = 0 (in this case) MSR/MSE 1 = 0 MSR/MSE big 0 Copyright 2012 Health Administration Press Assumptions of Linear Regression Linear regression is based on several assumptions. If these assumptions are violated, the resulting model will be misleading. The principal assumptions are: • The dependent and independent variables are linearly related. • The errors associated with the model are not serially correlated. • The errors are normally distributed and have constant variance. Copyright 2012 Health Administration Press Transformations X Y Transform X ->X2 −3 9 9 −2 4 4 −1 1 1 0 0 0 1 1 1 2 4 4 3 9 9 Copyright 2012 Health Administration Press Y If the variables are not linearly related or the assumptions of regression are violated, the variables can be transformed to produce a possibly better model. 10 8 6 4 2 0 0 2 4 6 X2 8 10 General Linear Model • The most general of all linear models • Multiple predictor variables: — Metric — Categorical — Both • Multiple dependent variables: — Metric — Categorical — Both • Can be used to build complex models Copyright 2012 Health Administration Press Outline for Analyses 1. Define the problem/question. 2. Determine what data will be needed to address the problem question. 3. Collect the data. 4. Graph the data. 5. Analyze the data using the appropriate tool. 6. “Fix” the problem. 7. Evaluate the effectiveness of the “fix.” 8. Start again. Copyright 2012 Health Administration Press Choice of Statistical Technique Independent Variable Categorical Dependent Variable One Categorical Metric Many Categorical Metric Both Copyright 2012 Health Administration Press Mathematical Graphical One 2 Many 2 (layered) One t-Test Histogram type Many MANOVA Box plot One 2 Many 2 (layered) One ANOVA Many MANOVA GLM Box plots Choice of Statistical Technique Independent Variable Metric Dependent Variable One Categorical Mathematical One Graphical Logit Many GLM Metric One Simple regression Many GLM Both Many Categorical MANCOVA One Logit Many GLM Metric One Multiple regression Many GLM Both Copyright 2012 Health Administration Press GLM; neural net Scatterplot Choice of Statistical Technique Independent Variable Dependent Variable Both Categorical Metric Both Copyright 2012 Health Administration Press Mathematical One ANCOVA Many MANCOVA One Simple regression Many Multiple regression GLM Neural Net Graphical End of Chapter 7 Copyright 2012 Health Administration Press