Download Document

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Statistical Design of
Experiments
SECTION II
REVIEW OF
STATISTICS
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
INTRODUCTION
• Difference between statistics and probability
• Statistical Inference
–
–
–
–
–
Samples and populations
Intro to JMP software package
Central limit theorem
Confidence intervals
Hypothesis testing
• Regression and modeling fundamentals
–
–
–
–
Introduction to Model Building
Simple linear regression
Multiple linear regression
Model Building
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
PROBABILITY VS STATISTICS
Problems
Dealing with
sources of
variability
Approach
Probability is the language
used to characterize
quantitative variability in
random experiments
Understanding the
Statistics allows us to infer
behavior of a process process behavior from a
from random
small number of
experiments on the
experiments or trials
process
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
POPULATION VS SAMPLE
Samples drawn from the population are used to
infer things about the population
Sample
1
Population
Dr. Gary Blau, Sean Han
Sample
3
Sample
2
Monday, Aug 13, 2007
BATCH REACTOR OPTIMIZATION
EXAMPLE
A new small molecule API, designated
simply C, is being produced in a batch
reactor in a pilot plant. Two liquid raw
materials A and B are added to the reactor
and the reaction A+B K1>C takes place.
(K1 is the reaction rate constant.)
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
BATCH REACTOR OPTIMIZATION
EXAMPLE
• There are various controllable factors
for the reactor, some of which are:
– Temperature
– Agitation rate
– A/B feed ratio
.........
• Adjusting the values or Levels of these
factors may change the yield of C
• We would like to find some combination
of these levels that will maximize C
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
STATISTICAL INFERENCE
Suppose 10 different batches are run and the
yield of C at the end of the reaction measured.
The properties of the population (i.e. all future
batches) can be estimated from the properties of
this sample of 10 batch runs.
Specifically it is possible to estimate the
parameters:
– Central Tendency
Mean, Median, Mode,
– Scatter or Variability
Variance, Standard Deviation, (Skewness,
Kurtosis)
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RANDOM SAMPLE
Each member of the population has
an equal chance of being selected
for the sample. (In the example, it
means that each batch of material
is made under the same processing
condition and is different only in
the time at which it was run.)
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
MEAN OF A SAMPLE
The average value of n batches in the
sample called the sample mean X :
n
X
X
i 1
n
i
Yield of the ith
batch
Sample size
It can be used to estimate the central
tendency of a population mean m.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
VARIANCE OF A SAMPLE
• Variance of a sample of size n is
n
s2 
2
(
X

X
)
 i
i 1
n 1
• The population variance, s2, can be
inferred from s2
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
INTRODUCTION TO JMP
• Background
– JMP is a statistical design and analysis
package. JMP helps you explore data, fit
models, and discover patterns
– JMP is from the SAS Institute, a large private
research institute specializing in data
analysis software.
• Features
– The emphasis in JMP is to interactively work
with data.
– Simple and informative graphics and plots
are often automatically shown to facilitate
discovery of behavioral patterns.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
INTRODUCTION TO JMP
• Limitations of JMP
– Large jobs
JMP is not suitable for problems with large data
sets. JMP data tables must fit in main memory of
your PC. JMP graphs everything. Sometimes
graphs get expensive and more cluttered when
they have many thousands of points.
– Specialized Statistics
JMP does only conventional data analysis.
Consider another package for performing more
complicated analysis. (e.g. SAS, R and S-Plus)
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
PROBABILITY DISTRIBUTION
USING JMP (EXAMPLE 1)
• The yield measurements from a
granulator are given below:
79, 91, 83, 78, 90, 84, 93, 83, 83, 80 %
• Using the statistical software
package JMP, calculate the mean,
variance, and standard deviation of
the data. Also, plot a distribution of
the data.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RESULTS FOR EXAMPLE 1
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
NORMAL DISTRIBUTION
• The outcomes from many physical
phenomenon frequently follow a single
type of distribution, the Normal
Distribution. (See Section I)
• If several samples are taken from a
population, the distribution of sample
means begins to look like a normal
distribution regardless of the
distribution of the event generating the
sample
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
CENTRAL LIMIT THEOREM
If random samples of n observations are drawn
from any population with finite mean m and
variance s2, then, when n is large, the sampling
distribution of the sample mean x is
approximately normally distributed with mean
and standard deviation:
E( X )  m
Dr. Gary Blau, Sean Han
sx 
s
n
Monday, Aug 13, 2007
EFFECTS OF SAMPLE SIZE
As the sample size, n, increases, the
variance of the sample mean
decreases.
n = 50
n = 30
x
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
SAMPLE SIZE EFFECTS
(EXAMPLE 2)
• Take 5 measurements for the yield from a
granulator and calculate the mean. Repeat this
process 50 times and generate a distribution of
mean values. The results are the JMP data table
S2E2.
• It can be shown that using 10 or 20
measurements in the first step will give greater
accuracy and less variability.
• Note the change in the shape of the
distributions with an increase in the individual
sample size, n.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RESULTS FOR EXAMPLE 2
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
????CONFIDENCE LIMITS
Confidence limits are used to express
the validity of statements about the
value of population parameters. For
instance:
– The yield C of the reactor in example 1 is
90% at a temperature of 250º F
– The yield C of the reactor is not significantly
changed when the temperature increases
from 242º to 246º F
– There is no significant difference between
the variance of the output of C at 250º and
260ºC
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
CONFIDENCE LIMITS
The bounds on the population parameters θ
take the form:
l
Dr. Gary Blau, Sean Han
u
Monday, Aug 13, 2007
CONFIDENCE LIMITS
The bounds are based on
– The size of the sample, n
– The confidence level, (1-a)
% confidence = (100)(1-a)
i.e., a = 0.1 means that if we generated 100
such intervals, 90 of them contain the true
(population) parameter
– These are not Bayesian intervals (those will
be discussed in the second module)
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
Z STATISTIC
• The Z statistic can be used to place
confidence limits on the population mean
when the population variance is known.
• Z distribution is a normally distributed
random variable with m=0 and s2=1.
i.e. Z N (0,1)
p( Z ) 
Dr. Gary Blau, Sean Han
 Z2 
1

exp 
2
 2 
Monday, Aug 13, 2007
Z STATISTIC
From Central limit theory, if n is large:
X

N (m ,
s2
x m
s
n
)
regardless of population distribution
~ N (0,1)  Z distribution
n

x m
s
n
Dr. Gary Blau, Sean Han
 Za
2
-Zα/2
Zα/2
Monday, Aug 13, 2007
CONFIDENCE LIMITS ON THE POPULATION
MEAN (POPULATION VARIANCE KNOWN)
• Two sided confidence interval
x
s
n
Za  m  x 
s
2
• One sided confidence
intervals
mx
Or m  x 
Dr. Gary Blau, Sean Han
s
n
s
n
n
Za
2
Zα/2
Zα/2
Za
Za
-Z α
Monday, Aug 13, 2007
t STATISTIC
• The t statistic is used to determine
confidence limits when the population
variance s2 is unknown and must be
estimated from the sample variance s2
Xn  m
i.e. T 
t (n  1) , t distribution
S/ n
with n-1 degree of freedom (df).
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
COMPARISON OF Z AND t
t distribution, df=2 t distribution, df=3 Z distribution
t distribution, df=1
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
CONFIDENCE LIMITS ON THE POPULATION
MEAN (POPULATION VARIANCE UNKNOWN)
• Two sided confidence interval
s
s
x
ta  m  x 
ta
n 2 ,n 1
n 2 ,n 1
• One sided confidence intervals
s
mx
ta ,n 1
n
s
mx
ta ,n 1
n
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
CONFIDENCE LIMITS ON THE
DIFFERENCE OF TWO MEANS
To get confidence limits on the difference of the
means of two different population µ1-µ2, we sample
from the two populations and calculate the sample
means X 1 , X 2 and sample variances S1, S2
respectively.
If we assume the populations have the same
variance(σ2=σ12 =σ22), the sample variances of the
two samples can be pooled to express a single
estimate of variance Sp2. The pooled variance Sp2
is calculated by:
(n  1) S12  (m  1) S22
S 
nm2
2
p
where n and m are the sample sizes of two samples from the
different populations.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
CONFIDENCE LIMITS ON THE
DIFFERENCE OF TWO MEANS
Known population variance: (Z - Distribution)
(i)
s 12  s 22  s 2
1 1 1/ 2
1 1 1/ 2
x1  x2  Za / 2s (  )  m1  m2  x1  x2  Za / 2s (  )
n1 n2
n1 n2
(ii )
Unequal variances
s 12 s 22 1/ 2
s 12 s 22 1/ 2
x1  x2  Za / 2 (  )  m1  m2  x1  x2  Za / 2 (  )
n1 n2
n1 n2
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
CONFIDENCE LIMITS ON THE
DIFFERENCE OF TWO MEANS
* Unknown population variance: (t - Distribution)
s 12  s 22  s 2
(i)
x1  x2  ta
2
, n1  n2
(ii )
x1  x2  ta
s (
2 p
but unknown
1 1
1 1
 )  m1  m2  x1  x2  ta ,n  n  2 s p (  )
2 1 2
n1 n2
n1 n2
Unequal variance
2
,
S12
S 22
(

)  m1  m2  x1  x2  ta ,
2
n1
n2
S12
S 22
(

)
n1
n2
( s12 / n1  s22 / n2 ) 2
 2
(( s1 / n1 ) 2 /(n1  1)  ( s22 / n2 ) 2 /(n2  1))
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
EXAMPLE 3
Two samples, each of size 10, are taken from a
dissolution apparatus. The first one is taken at
a temperature of 35ºC and the second at a
temperature of 37ºC. The results of these
experiments are the JMP data table S2E3&4.
Using JMP, calculate the mean of each sample
and use confidence limits to determine if there
is a significant difference between the means of
the two samples at the 95% confidence level (α
= 0.05).
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RESULTS FOR EXAMPLE 3
There is a significant difference between the means of
the two samples at the 95% confidence level (α = 0.05).
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
MODEL BUIDING
• Building multiple linear regression
model
– Stepwise: Add and remove variables over
several steps
– Forward: Add variables sequentially
– Backward: Remove variables sequentially
• JMP provides criteria for model
selection like R2, Cp and MSE.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
HYPOTHESIS TESTING
• Although confidence limits can be used to infer the
quality of the population parameters from samples drawn
from the population, an alternative and more convenient
approach for model building is to use hypothesis testing.
• Whenever a decision is to be made about a population
characteristic, make a hypothesis about the population
parameter and test it with data from samples.
• Generally statistical test tests the null hypothesis H0
against the alternate hypothesis Ha.
• In the example 3, H0 is that there is no difference between
these two experiments. Ha is that there is significant
difference between the two experiments.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
GENERAL PROCEDURE FOR
HYPOTHESIS TESTING
1. Specify H0 and Ha to test. This typically has to be a
hypothesis that makes a specific prediction.
2. Declare an alpha level
3. Specify the test statistic against which the observed
statistic will be compared.
4. Collect the data and calculate the observed t
statistic.
5. Make Conclusion. Reject the null hypothesis if and
only if the observed t statistic is larger than the
critical one.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
TYPE I AND TYPE II ERROR
Comparing the state of nature and
decision, we have four situations.
•
•
•
•
State of nature
Null hypothesis true
Null hypothesis true
Null hypothesis false
Null hypothesis false
Dr. Gary Blau, Sean Han
Decision
Fail to reject Null
Reject Null
Fail to reject Null
Reject Null
Monday, Aug 13, 2007
TYPE I AND TYPE II ERROR
• Type 1 (a) error
– False positive
– We are observing a
difference that does
not exist
• Type II (b) error
– False negative
– We fail to observe a
difference that does
exist
Dr. Gary Blau, Sean Han
Null
True
Null
False
Reject
Type I
Error
Correct
Fail to
Reject
Correct
Type II
Error
Monday, Aug 13, 2007
P - VALUE
• The specific value of a when the
population parameter and one of the
confidence limits coincide
– The observed level of significance
• A more technical definition:
– The probability (under the null
hypothesis) of observing a test
statistic that is at least as extreme as
the one that is actually observed
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
INFERENCE SUMMARY
• Population properties are inferred from
sample properties via the central limit
theorem
• Confidence intervals tell us something
about out how well we understand a
parameter… but give no guarantees
(type 2 error)
• P values give us a quick number to
check to see how significant a test is.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
MODEL BUIDING
• “All models are wrong, but some
are useful.”
– George Box
• “A model should be as simple as
possible, but no simpler.”
– Albert Einstein
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
REGRESSION MODEL
Regression analysis creates empirical
mathematical models which determine which
factors are important and quantify their effects on
the process but do not explain underlying
phenomenon
Inputs
Outputs
Process Conditions
Often called
model parameters
Outputs = f (inputs, process conditions, coefficients) + error
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
SIMPLE LINEAR REGRESSION
Simple linear regression model (one independent
factor or variable):
Y = β0 + β1X + e
where e is a measure of experimental and modeling error
β0, β1 are regression coefficients
Y is the response
X is the factor
These models assume that we can measure X
perfectly and all error or variability is in Y.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
SIMPLE LINEAR REGRESSION
For a one factor model, we obtain Y as
a function of X
。
Y
。。
。 。
。 。
X
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
CORRELATION COEFFICIENTS
The correlation between the factor
and the response is indicated by the
regression coefficient b1 which may
be:
– Zero
– Positive
– Negative
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
LACK OF CORRELATION
If b1 = 0 the response does not
depend on the factor
。
Y
。
。
。
。
。
X
Dr. Gary Blau, Sean Han
Y
。
。。。
。
。
。
。
X
Monday, Aug 13, 2007
POSITIVE CORRELATION
COEFFICIENTS
If b1 > 0 the response and factor are
positively correlated
。
Y
。
。 。
。
。
X
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
NEGATIVE CORRELATION
COEFFICIENTS
If b1 < 0 the response and factor are
negatively correlated
Y
。
。。 。
。 。
X
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
LEAST SQUARES
The coefficients are usually estimated using the method of
least squares (or Method of Maximum Likelihood)
Y
Yi
Observed value
。
。
。。
。
。
。
Xi
Estimated regression line
X
This method minimizes the sum of the squares of the
difference between the values predicted by the model at ith
data point, yˆ i and the observed value Yi at the same value of
Xi
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
EXAMPLE 4
Use the previous yield data
(T2E3&4) from different
dissolution temperatures. Make a
model that describes the effect of
temperature on the yield. Note that
here, temperature is the factor and
the yield is the response.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RESULTS FOR EXAMPLE 4
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
MULTIPLE LINEAR REGRESSION –
ONE FACTOR
• If a simple linear regression equation
does not adequately describe a set of
data then multiple linear regression
models may be used.
• Multiple linear regression equation for
response variable Y and a single factor X
takes the form of a polynomial:
Y= β0 + β1X + β2X2 + β3X3 + …. + βmXm
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
EXAMPLE 5
Three samples of size 10 are taken
from an API (Active Pharmaceutical
Ingredient) plant. The first one was
taken at a batch reactor pressure of
3 bar, the second at 3.5 bar, and the
final at 4 bar. The data table is
T2E5. Use regression analysis to
build a model describing the effect
of pressure on the yield of the API,
using a squared term if necessary.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RESULTS FOR EXAMPLE 5
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
MULTIPLE LINEAR REGRESSION –
MORE THAN ONE FACTOR
• If more than one regressor is needed in
the model, multiple linear regression
models may be used to find relationship
between Y and combination of factors
X1, X2, …, Xp.
• Multiple linear regression equation for
one response variable Y and factors X1,
X2, …, Xp takes the form of a polynomial.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
EXAMPLE OF MULTILINEAR
REGRESSION MODEL
Y= β0 + β1X1 + β2X2 + β3X3 + …. + βmXm+ ε
Y= β0 + β1X1 + β2X1X2 + β3X1X3 + ε
Y= β0 + β1X1 + β2X12 + β3X2 + β4X12X24 + ε
Y = β0 + β1X1X235 + β2X33 + ε
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
NONLINEAR MODELS
• A model is said to be nonlinear if
y / bi  g (any b j ) i
• Example
Y = β0 exp(-β1X1) + ε
Y = β 0 X 1β1 + ε
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
EVALUATING REGRESSION
MODELS
• To determine if a model is adequate to
describe the observed data, the
analysis of variance may be performed
• Calculate the deviation between the
data points and the values predicted by
the model, called the error sum of
squares (SSE)
n
SS E   ( yi  yˆi ) 2
i 1
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
SUM OF SQUARES
• Calculate the total variance in the data, called the
total sum of squares (SST)
n
SST   ( yi  y )
2
i 1
• The amount of the total variance explained by the
model called the regression sum of squares.
n
SS R   ( yˆi  y ) 2
i 1
• It may be shown that:
SST = SSR + SSE
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
SOURCES OF VARIABILITY
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
MEAN SQUARE
The mean square is the sum of squares
divided by the associated degrees of
freedom (DOF)
MSR = SSR / p
MSE = SSE / (n-2)
Total DOF = DOF for Regression + DOF for Error
n 1 
p
+
n-p-1
where p is the number of parameters in the
model.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
F TEST
In multiple linear regression, F
statistic can be used in hypothesis
testing.
H0 in this hypothesis testing is that
all the β’s except β0 are 0.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
F TEST AND R2
• The mean squares are used to perform
an F test since they estimate specific
population variances
F = MSR / MSE
• The sum of squares are used to
calculate the R2 criterion
R2 = SSR/SST= 1- SSE/SST
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
EXAMPLE 6
Examine the variance of the model
created to describe the effect of
pressure on the yield of API (in
Example 5).
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RESULTS FOR EXAMPLE 6
Since the p value for F test is <.0001, which is significant in
the .05 level, the overall model is significant.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
EXAMPLE 7
Build a model using JMP data table
T2E7 with potential factors:
temperature, A/B feed ratio, and
Termination time and the response
variable yield. Determine which
terms are significant. Build the
model using forward and backward
selection technique.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007
RESULTS FOR EXAMPLE 7
The Temperature
and Termination
time are significant
on the .05 level.
Dr. Gary Blau, Sean Han
Monday, Aug 13, 2007