Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Design of Experiments SECTION II REVIEW OF STATISTICS Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 INTRODUCTION • Difference between statistics and probability • Statistical Inference – – – – – Samples and populations Intro to JMP software package Central limit theorem Confidence intervals Hypothesis testing • Regression and modeling fundamentals – – – – Introduction to Model Building Simple linear regression Multiple linear regression Model Building Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 PROBABILITY VS STATISTICS Problems Dealing with sources of variability Approach Probability is the language used to characterize quantitative variability in random experiments Understanding the Statistics allows us to infer behavior of a process process behavior from a from random small number of experiments on the experiments or trials process Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 POPULATION VS SAMPLE Samples drawn from the population are used to infer things about the population Sample 1 Population Dr. Gary Blau, Sean Han Sample 3 Sample 2 Monday, Aug 13, 2007 BATCH REACTOR OPTIMIZATION EXAMPLE A new small molecule API, designated simply C, is being produced in a batch reactor in a pilot plant. Two liquid raw materials A and B are added to the reactor and the reaction A+B K1>C takes place. (K1 is the reaction rate constant.) Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 BATCH REACTOR OPTIMIZATION EXAMPLE • There are various controllable factors for the reactor, some of which are: – Temperature – Agitation rate – A/B feed ratio ......... • Adjusting the values or Levels of these factors may change the yield of C • We would like to find some combination of these levels that will maximize C Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 STATISTICAL INFERENCE Suppose 10 different batches are run and the yield of C at the end of the reaction measured. The properties of the population (i.e. all future batches) can be estimated from the properties of this sample of 10 batch runs. Specifically it is possible to estimate the parameters: – Central Tendency Mean, Median, Mode, – Scatter or Variability Variance, Standard Deviation, (Skewness, Kurtosis) Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RANDOM SAMPLE Each member of the population has an equal chance of being selected for the sample. (In the example, it means that each batch of material is made under the same processing condition and is different only in the time at which it was run.) Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 MEAN OF A SAMPLE The average value of n batches in the sample called the sample mean X : n X X i 1 n i Yield of the ith batch Sample size It can be used to estimate the central tendency of a population mean m. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 VARIANCE OF A SAMPLE • Variance of a sample of size n is n s2 2 ( X X ) i i 1 n 1 • The population variance, s2, can be inferred from s2 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 INTRODUCTION TO JMP • Background – JMP is a statistical design and analysis package. JMP helps you explore data, fit models, and discover patterns – JMP is from the SAS Institute, a large private research institute specializing in data analysis software. • Features – The emphasis in JMP is to interactively work with data. – Simple and informative graphics and plots are often automatically shown to facilitate discovery of behavioral patterns. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 INTRODUCTION TO JMP • Limitations of JMP – Large jobs JMP is not suitable for problems with large data sets. JMP data tables must fit in main memory of your PC. JMP graphs everything. Sometimes graphs get expensive and more cluttered when they have many thousands of points. – Specialized Statistics JMP does only conventional data analysis. Consider another package for performing more complicated analysis. (e.g. SAS, R and S-Plus) Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 PROBABILITY DISTRIBUTION USING JMP (EXAMPLE 1) • The yield measurements from a granulator are given below: 79, 91, 83, 78, 90, 84, 93, 83, 83, 80 % • Using the statistical software package JMP, calculate the mean, variance, and standard deviation of the data. Also, plot a distribution of the data. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RESULTS FOR EXAMPLE 1 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 NORMAL DISTRIBUTION • The outcomes from many physical phenomenon frequently follow a single type of distribution, the Normal Distribution. (See Section I) • If several samples are taken from a population, the distribution of sample means begins to look like a normal distribution regardless of the distribution of the event generating the sample Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 CENTRAL LIMIT THEOREM If random samples of n observations are drawn from any population with finite mean m and variance s2, then, when n is large, the sampling distribution of the sample mean x is approximately normally distributed with mean and standard deviation: E( X ) m Dr. Gary Blau, Sean Han sx s n Monday, Aug 13, 2007 EFFECTS OF SAMPLE SIZE As the sample size, n, increases, the variance of the sample mean decreases. n = 50 n = 30 x Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 SAMPLE SIZE EFFECTS (EXAMPLE 2) • Take 5 measurements for the yield from a granulator and calculate the mean. Repeat this process 50 times and generate a distribution of mean values. The results are the JMP data table S2E2. • It can be shown that using 10 or 20 measurements in the first step will give greater accuracy and less variability. • Note the change in the shape of the distributions with an increase in the individual sample size, n. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RESULTS FOR EXAMPLE 2 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 ????CONFIDENCE LIMITS Confidence limits are used to express the validity of statements about the value of population parameters. For instance: – The yield C of the reactor in example 1 is 90% at a temperature of 250º F – The yield C of the reactor is not significantly changed when the temperature increases from 242º to 246º F – There is no significant difference between the variance of the output of C at 250º and 260ºC Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 CONFIDENCE LIMITS The bounds on the population parameters θ take the form: l Dr. Gary Blau, Sean Han u Monday, Aug 13, 2007 CONFIDENCE LIMITS The bounds are based on – The size of the sample, n – The confidence level, (1-a) % confidence = (100)(1-a) i.e., a = 0.1 means that if we generated 100 such intervals, 90 of them contain the true (population) parameter – These are not Bayesian intervals (those will be discussed in the second module) Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 Z STATISTIC • The Z statistic can be used to place confidence limits on the population mean when the population variance is known. • Z distribution is a normally distributed random variable with m=0 and s2=1. i.e. Z N (0,1) p( Z ) Dr. Gary Blau, Sean Han Z2 1 exp 2 2 Monday, Aug 13, 2007 Z STATISTIC From Central limit theory, if n is large: X N (m , s2 x m s n ) regardless of population distribution ~ N (0,1) Z distribution n x m s n Dr. Gary Blau, Sean Han Za 2 -Zα/2 Zα/2 Monday, Aug 13, 2007 CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE KNOWN) • Two sided confidence interval x s n Za m x s 2 • One sided confidence intervals mx Or m x Dr. Gary Blau, Sean Han s n s n n Za 2 Zα/2 Zα/2 Za Za -Z α Monday, Aug 13, 2007 t STATISTIC • The t statistic is used to determine confidence limits when the population variance s2 is unknown and must be estimated from the sample variance s2 Xn m i.e. T t (n 1) , t distribution S/ n with n-1 degree of freedom (df). Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 COMPARISON OF Z AND t t distribution, df=2 t distribution, df=3 Z distribution t distribution, df=1 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 CONFIDENCE LIMITS ON THE POPULATION MEAN (POPULATION VARIANCE UNKNOWN) • Two sided confidence interval s s x ta m x ta n 2 ,n 1 n 2 ,n 1 • One sided confidence intervals s mx ta ,n 1 n s mx ta ,n 1 n Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS To get confidence limits on the difference of the means of two different population µ1-µ2, we sample from the two populations and calculate the sample means X 1 , X 2 and sample variances S1, S2 respectively. If we assume the populations have the same variance(σ2=σ12 =σ22), the sample variances of the two samples can be pooled to express a single estimate of variance Sp2. The pooled variance Sp2 is calculated by: (n 1) S12 (m 1) S22 S nm2 2 p where n and m are the sample sizes of two samples from the different populations. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS Known population variance: (Z - Distribution) (i) s 12 s 22 s 2 1 1 1/ 2 1 1 1/ 2 x1 x2 Za / 2s ( ) m1 m2 x1 x2 Za / 2s ( ) n1 n2 n1 n2 (ii ) Unequal variances s 12 s 22 1/ 2 s 12 s 22 1/ 2 x1 x2 Za / 2 ( ) m1 m2 x1 x2 Za / 2 ( ) n1 n2 n1 n2 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 CONFIDENCE LIMITS ON THE DIFFERENCE OF TWO MEANS * Unknown population variance: (t - Distribution) s 12 s 22 s 2 (i) x1 x2 ta 2 , n1 n2 (ii ) x1 x2 ta s ( 2 p but unknown 1 1 1 1 ) m1 m2 x1 x2 ta ,n n 2 s p ( ) 2 1 2 n1 n2 n1 n2 Unequal variance 2 , S12 S 22 ( ) m1 m2 x1 x2 ta , 2 n1 n2 S12 S 22 ( ) n1 n2 ( s12 / n1 s22 / n2 ) 2 2 (( s1 / n1 ) 2 /(n1 1) ( s22 / n2 ) 2 /(n2 1)) Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 EXAMPLE 3 Two samples, each of size 10, are taken from a dissolution apparatus. The first one is taken at a temperature of 35ºC and the second at a temperature of 37ºC. The results of these experiments are the JMP data table S2E3&4. Using JMP, calculate the mean of each sample and use confidence limits to determine if there is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05). Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RESULTS FOR EXAMPLE 3 There is a significant difference between the means of the two samples at the 95% confidence level (α = 0.05). Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 MODEL BUIDING • Building multiple linear regression model – Stepwise: Add and remove variables over several steps – Forward: Add variables sequentially – Backward: Remove variables sequentially • JMP provides criteria for model selection like R2, Cp and MSE. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 HYPOTHESIS TESTING • Although confidence limits can be used to infer the quality of the population parameters from samples drawn from the population, an alternative and more convenient approach for model building is to use hypothesis testing. • Whenever a decision is to be made about a population characteristic, make a hypothesis about the population parameter and test it with data from samples. • Generally statistical test tests the null hypothesis H0 against the alternate hypothesis Ha. • In the example 3, H0 is that there is no difference between these two experiments. Ha is that there is significant difference between the two experiments. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 GENERAL PROCEDURE FOR HYPOTHESIS TESTING 1. Specify H0 and Ha to test. This typically has to be a hypothesis that makes a specific prediction. 2. Declare an alpha level 3. Specify the test statistic against which the observed statistic will be compared. 4. Collect the data and calculate the observed t statistic. 5. Make Conclusion. Reject the null hypothesis if and only if the observed t statistic is larger than the critical one. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 TYPE I AND TYPE II ERROR Comparing the state of nature and decision, we have four situations. • • • • State of nature Null hypothesis true Null hypothesis true Null hypothesis false Null hypothesis false Dr. Gary Blau, Sean Han Decision Fail to reject Null Reject Null Fail to reject Null Reject Null Monday, Aug 13, 2007 TYPE I AND TYPE II ERROR • Type 1 (a) error – False positive – We are observing a difference that does not exist • Type II (b) error – False negative – We fail to observe a difference that does exist Dr. Gary Blau, Sean Han Null True Null False Reject Type I Error Correct Fail to Reject Correct Type II Error Monday, Aug 13, 2007 P - VALUE • The specific value of a when the population parameter and one of the confidence limits coincide – The observed level of significance • A more technical definition: – The probability (under the null hypothesis) of observing a test statistic that is at least as extreme as the one that is actually observed Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 INFERENCE SUMMARY • Population properties are inferred from sample properties via the central limit theorem • Confidence intervals tell us something about out how well we understand a parameter… but give no guarantees (type 2 error) • P values give us a quick number to check to see how significant a test is. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 MODEL BUIDING • “All models are wrong, but some are useful.” – George Box • “A model should be as simple as possible, but no simpler.” – Albert Einstein Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 REGRESSION MODEL Regression analysis creates empirical mathematical models which determine which factors are important and quantify their effects on the process but do not explain underlying phenomenon Inputs Outputs Process Conditions Often called model parameters Outputs = f (inputs, process conditions, coefficients) + error Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 SIMPLE LINEAR REGRESSION Simple linear regression model (one independent factor or variable): Y = β0 + β1X + e where e is a measure of experimental and modeling error β0, β1 are regression coefficients Y is the response X is the factor These models assume that we can measure X perfectly and all error or variability is in Y. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 SIMPLE LINEAR REGRESSION For a one factor model, we obtain Y as a function of X 。 Y 。。 。 。 。 。 X Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 CORRELATION COEFFICIENTS The correlation between the factor and the response is indicated by the regression coefficient b1 which may be: – Zero – Positive – Negative Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 LACK OF CORRELATION If b1 = 0 the response does not depend on the factor 。 Y 。 。 。 。 。 X Dr. Gary Blau, Sean Han Y 。 。。。 。 。 。 。 X Monday, Aug 13, 2007 POSITIVE CORRELATION COEFFICIENTS If b1 > 0 the response and factor are positively correlated 。 Y 。 。 。 。 。 X Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 NEGATIVE CORRELATION COEFFICIENTS If b1 < 0 the response and factor are negatively correlated Y 。 。。 。 。 。 X Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 LEAST SQUARES The coefficients are usually estimated using the method of least squares (or Method of Maximum Likelihood) Y Yi Observed value 。 。 。。 。 。 。 Xi Estimated regression line X This method minimizes the sum of the squares of the difference between the values predicted by the model at ith data point, yˆ i and the observed value Yi at the same value of Xi Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 EXAMPLE 4 Use the previous yield data (T2E3&4) from different dissolution temperatures. Make a model that describes the effect of temperature on the yield. Note that here, temperature is the factor and the yield is the response. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RESULTS FOR EXAMPLE 4 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 MULTIPLE LINEAR REGRESSION – ONE FACTOR • If a simple linear regression equation does not adequately describe a set of data then multiple linear regression models may be used. • Multiple linear regression equation for response variable Y and a single factor X takes the form of a polynomial: Y= β0 + β1X + β2X2 + β3X3 + …. + βmXm Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 EXAMPLE 5 Three samples of size 10 are taken from an API (Active Pharmaceutical Ingredient) plant. The first one was taken at a batch reactor pressure of 3 bar, the second at 3.5 bar, and the final at 4 bar. The data table is T2E5. Use regression analysis to build a model describing the effect of pressure on the yield of the API, using a squared term if necessary. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RESULTS FOR EXAMPLE 5 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 MULTIPLE LINEAR REGRESSION – MORE THAN ONE FACTOR • If more than one regressor is needed in the model, multiple linear regression models may be used to find relationship between Y and combination of factors X1, X2, …, Xp. • Multiple linear regression equation for one response variable Y and factors X1, X2, …, Xp takes the form of a polynomial. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 EXAMPLE OF MULTILINEAR REGRESSION MODEL Y= β0 + β1X1 + β2X2 + β3X3 + …. + βmXm+ ε Y= β0 + β1X1 + β2X1X2 + β3X1X3 + ε Y= β0 + β1X1 + β2X12 + β3X2 + β4X12X24 + ε Y = β0 + β1X1X235 + β2X33 + ε Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 NONLINEAR MODELS • A model is said to be nonlinear if y / bi g (any b j ) i • Example Y = β0 exp(-β1X1) + ε Y = β 0 X 1β1 + ε Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 EVALUATING REGRESSION MODELS • To determine if a model is adequate to describe the observed data, the analysis of variance may be performed • Calculate the deviation between the data points and the values predicted by the model, called the error sum of squares (SSE) n SS E ( yi yˆi ) 2 i 1 Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 SUM OF SQUARES • Calculate the total variance in the data, called the total sum of squares (SST) n SST ( yi y ) 2 i 1 • The amount of the total variance explained by the model called the regression sum of squares. n SS R ( yˆi y ) 2 i 1 • It may be shown that: SST = SSR + SSE Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 SOURCES OF VARIABILITY Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 MEAN SQUARE The mean square is the sum of squares divided by the associated degrees of freedom (DOF) MSR = SSR / p MSE = SSE / (n-2) Total DOF = DOF for Regression + DOF for Error n 1 p + n-p-1 where p is the number of parameters in the model. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 F TEST In multiple linear regression, F statistic can be used in hypothesis testing. H0 in this hypothesis testing is that all the β’s except β0 are 0. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 F TEST AND R2 • The mean squares are used to perform an F test since they estimate specific population variances F = MSR / MSE • The sum of squares are used to calculate the R2 criterion R2 = SSR/SST= 1- SSE/SST Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 EXAMPLE 6 Examine the variance of the model created to describe the effect of pressure on the yield of API (in Example 5). Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RESULTS FOR EXAMPLE 6 Since the p value for F test is <.0001, which is significant in the .05 level, the overall model is significant. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 EXAMPLE 7 Build a model using JMP data table T2E7 with potential factors: temperature, A/B feed ratio, and Termination time and the response variable yield. Determine which terms are significant. Build the model using forward and backward selection technique. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007 RESULTS FOR EXAMPLE 7 The Temperature and Termination time are significant on the .05 level. Dr. Gary Blau, Sean Han Monday, Aug 13, 2007