Download Handout - dollar

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Samples & Summary Measures
 Sample
A set of observations from a population:
x1, x2, ..., xn
Example: Measure the diameters of 20
pistons produced on a production line, xi
= diameter of piston # i.
 Summary Measures
Sample Mean
Sample Variance
Sample Mean
Just the average of the sample:
n
x 
 xi
i 1
n
Example: x1 = 10, x2 = 12, x3 = 16, x4 = 18
4
x 
 xi
i 1
4

x1  x 2  x 3  x 4 10  12  16 18 56


 14
4
4
4
NOTE: The sample mean is an unbiased
estimator of the population mean, that is:
4
4
 4x 
  i   E x i    4 
E (x )  E  i 1   i 1
 i 1 

4
4
4
4




where  is the population mean.
Sample Variance
The sample variance, S2, is a measure of how
widely dispersed the sample is. The sample
variance is an estimator of the population
variance, 2.
n
S2 
 x i  x 
i 1
2
n 1
Example: x1 = 10, x2 = 12, x3 = 16, x4 = 18
x 14
2
2
2
2
10 14   12 14   16 14   18 14 

S 

2
4 1
 4 2   2 2  2 2  4 2  16  4  4 16  40 .
3
3
Question: Why n-1 instead of n?
 provides an unbiased estimate
E(S2) = 2
 only n-1 degrees of freedom
3
n-1 degrees of freedom
For a set of n points, if we know the sample
mean & also know the values of n-1 of the
points, we also know the value of the remaining
point.
Example:
x  14,
x1  10, x 2  12 , x 3  16 ,
x4  ?
Then,
10  12 16  x 4
 14  x 4  18
4
Central Limit Theorem
Revisited
When sampling from a population with mean  and
standard deviation , the sampling distribution of the
sample mean, x , will tend to a normal distribution
with mean  and standard deviation  n as the
sample size n becomes large.
For “large enough” n,
x  N  , 
n
Note: As n gets larger the variance (& standard
deviation) of the sample size gets smaller.
Confidence Intervals
Heart Valve Manufacturer
Dimension
Mean
Piston Diameter 0.060
Sleeve Diameter 0.065
Clearance
0.005
(unsorted)
Std. Deviation
0.0002
0.0002
0.000283
Decision: Implement sorting with batches of 5
A random sample (after sorting has been
implemented) of 100 piston/valve assemblies yields
79 valid (meet tolerances) assemblies out of the 100
trials.
How do we know whether or not the process change
has really improved the resulting yield?
The yield (# good assemblies out of 100) is a
binomial random variable.
Our estimate of the mean (based on this sample) is
79% (or 79 out of 100).
One way of determining whether the process has
been improved is to construct a confidence interval
about our estimate.
To see how to do this, let X denote the number of
good assemblies in 100 trials. Note that we can use
the BINOMDIST function in excel to compute, for
example, the probability that X is within + or – 10 of
our estimate, 79:
P{69  X  89} = P{ X  89}- P{ X  69}
= BINOMDIST(89,100,0.79,true)BINOMDIST(69,100,0.79,true)
 0.9971 – 0.0123  0.985
Another way of stating this in words is that 79  10 is
a 98.5% confidence interval for the number of valid
assemblies out of 100.
The following table was constructed using the
binomdist function as described above. It gives
confidence intervals for various confidence levels.
Confidence Interval
79  2
79  4
79  8
79  10
79  14
% Confidence
37.6%
67.4%
94.9%
98.5%
99.9%
Note: the larger the interval the more certain we
become that it covers the true mean.
Note that the yield of the original process was 52%.
Since the lower limit of a 99.9% confidence interval
about our sample mean is 65% (substantially larger
than 52%) we can be pretty certain the process has
improved.
Confidence Intervals - Using Central
Limit Theorem
Recall that: X = # successful assemblies in 100 trials
An estimate, p , of the probability of obtaining a
successful assembly, p, is given by:
p 
X
100
1 if trial i is successful
If we define: X i  
0 if trial i is a failure
then
X = X1 + X2 + X3 +… + X100
Furthermore,
p 
X 1  X 2 ... X 100

100
1
1
1
X1 
X 2 ...
X 100
100
100
100
Note that Xi is the number of successes in 1 trial - a
binomial random variable where p is the probability
of success.
It follows that the mean and standard deviation of
each Xi are given by:
 p
i
  p1  p 
i
Applying the Central Limit Theorem tells us that p
(since it is the weighted sum of a large number of
independent random variables) must be
approximately normally distributed with mean and
standard deviation given by:
1
1
1

1 
2 ...
100 
100
100
100
1
1
1
p
p ...
p p
100
100
100
2
2
2
1 
 1   2 
2  1 
2
  
 1  
  2 ...

 100
 100
 100 100
2
2
2
 1  p 1  p   1  p 1  p ... 1  p 1  p

 

 
    

 100


100
100
2
1 
 100
 p1  p , so
 100

p1  p
100
Central Limit Theorem
For
Population Proportions
As the sample size, n, increases, the sampling
distribution of p̂ approaches a normal distribution
with mean p and standard deviation p(1  p) / n .
When the parameter p approaches 1/2, the binomial
distribution is symmetric and shaped much like the
normal distribution. When p moves either above or
below 1/2, the binomial distribution becomes more
heavily skewed away from the normal & hence the
sample size, n, necessary for the CLT to apply
becomes larger.
A commonly used rule of thumb is that np and n(1-p)
must both be larger than 5.
From our sample of 100 assemblies, 79 were good.
This implies that our estimates for the mean and
standard deviation of p are given by:
  0.79
 
0.791  0.79
0.790.21

 0.040731
100
10
To reiterate, we are using the Central Limit Theorem
to approximate the distribution of p as a Normal
distribution with mean 0.79 and standard deviation
0.040731.
What we mean by, for example, a 95% confidence
interval is to find a number, r, satisfying:
P0.79  r  p  0.79  r  0.95
Since the Normal distribution is symmetric about its
mean (in this case 0.79), this means that exactly half
of the “leftover” probability (5% for a 95%
confidence interval) must lie in each tail. This means
that a probability of 2.5% must lie in each tail for a
95% confidence interval. In other words,
P p  0.79  r  0.025, and
Pp  0.79 + r  0.975
To perform this calculation, use the NORMINV
function from excel:
0.79 + r = NORMINV(0.975,0.79,0.040731)  0.8698.
Solving for r, we get r  0.0798, so
a 95% confidence interval for p is given by
p = 0.79  0.0798
Note that the resulting confidence interval is almost
identical to that obtained by using the BINOMDIST
function directly-----clearly an indication that our use
of the Central Limit Theorem to approximate the
distribution of p with a Normal distribution is valid.
The following table summarizes calculations for
various confidence levels using the Normal
approximation.
Confidence Level
99.9%
98.5%
95%
67.4%
Confidence Interval
0.79  0.135
0.79  0.099
0.79  0.0798
0.79  0.04
In general, when sampling for the population
proportion in this manner our estimates for the mean
and standard deviation will be given by:
ˆ
ˆ 
ˆ 1  ˆ 
n
Election Polling Example:
1500 prospective voters are surveyed. 825 say they
will vote for candidate A and 675 say they will vote
for B. What is your estimate of the percentage of
voters who will vote for A? Construct a 95%
confidence interval.
 
825
 0.55
1500
To construct the confidence interval, use the normal
approximation.
 
0.551  0.55
1500
 0.0128
NORMINV(0.975,0.55,0.0128) = 0.575
The associated confidence interval is, therefore,
55%  2.5%. The newsmedia would report this
result by stating: “The poll has a margin of error of
plus or minus 2.5 percentage points”.
What would happen if the 55% estimate were based
on a sample of size 750?
 
0.551  0.55
750
 0.0182
NORMINV(0.975,0.55,0.0182) = 0.586
The associated confidence interval is, therefore,
55%  3.6%.
Notation
The Standard Normal distribution has:
  0,
  1.
Let z be a random variable with a standard normal
distribution. We define Z1 2 to be the number
satisfying:


P z  Z 1  2  1   2 .
Example: If  = 0.05, then 1-/2 = 0.975, and the
value of Z1 2 can be found using the excel function
NORMSINV:
NORMSINV(1-/2)= Z1 2 , or
NORMSINV(0.975)=1.9599.
Confidence Intervals for a Population Proportion
Using the Standard Normal
From a sample of 500, the number who say they
prefer Coke to Pepsi is 275. Your estimate of the
population proportion who prefer Coke is 275/500 =
55%.
Since n is large, we can apply the CLT and construct
a confidence interval for the population proportion, p:
ˆ  Z
ˆ 1  ˆ 
1 2
n
provides a (1-)100% confidence interval for the
population proportion, p. For a 95% interval in this
case, we would first determine that
NORMSINV(0.975) = 1.96. We would then compute:
0.55  1.96
0.551  0.55
 0.55  0.0436.
500
Confidence Intervals on the Sample Mean
 95% of the observations from a normal
distribution fall within +/- 2 standard
deviations from the mean.
 The average of a sample, x , is (from CLT)
normally distributed with mean  and standard
deviation  n .
It follows that the true mean will lie within +/- 2
standard deviations of the sample average 95% of the
time. The associated confidence interval is:
 x  2  ,x  2  



n
n
Note that the standard deviation used here is 
rather than  .
n
Confidence Intervals (Again)
In general, if you have estimated the true mean by
using the sample mean, x , as an estimate, the range

 
x  Z
,
x

Z


1  2
1  2

n
n
is expected to contain, or cover, the true mean
100(1-)%
of the time.
Example: The diameter of pistons is normally
distributed with unknown mean and standard
deviation of 0.01. You take a small sample of 5,
measure their diameters, and compute a sample mean
of 1.55. A 90% confidence interval would be given
by:
0.01
0.01

,1.55  Z
1.55  Z
.
5
5

0.95
0.95
From Excel, we can compute
Z0.95= NORMSINV(0.95)=1.645,
& hence the interval is:
(1.543, 1.557).
Example : 1-tailed Test on a Single Mean
The yield of a new process is known to be normally distributed.
The current process has an average yield of 0.85 (85%) with a
standard deviation of 0.05. The new process is believed to have
the same deviation as the old one. To determine whether the yield
of the new process is higher than that of the old, you collect a
random sample of size 10 & compute a sample mean of 90%.
Let  = 0.01.
Solution:
1. Formulate Hypothesis:
H0:   0.85, H1:  > 0.85
(Important Note: alternative hypothesis is associated with action.)
2. Compute Test Statistic:
x   0 0 .9  0 .85
z

 3162
.
 n
0 .05 10
3. Determine Acceptance Region: Reject if
NORMSINV(1-) < z

NORMSINV(0.99) < 3.162

2.326 < 3.162
Reject Null Hypothesis!
Our test statistic lies 3.162 standard deviations to the right of the mean of the sampling
distribution----clearly an indicator that the observed outcome is very unlikely if the null
hypothesis held.
Example: 2-Tailed Test on a Single Mean
An auto manufacturer has an old engine that produces an average
of 31.5 mpg. A new engine is believed to have the same standard
deviation in mpg, 6.6, as the old engine, but it is unknown whether
or not the new engine has the same average mpg. The sample
mean of a random sample of 100 turns out to be 29.8. Let  =
0.05.
Solution:
1. Formulate Hypothesis:
H0:  = 31.5, H1:   31.5
2. Compute Test Statistic:
z
x   0 29 .8  31.5

 2 .576
 n
6 .6 100
3. Determine Acceptance Region: Fail to reject if
NORMSINV(/2)  z  NORMSINV(1-/2)

NORMSINV(0.025)  -2.576  NORMSINV( 0.0975)

-1.959  -2.576  1.959
Since the inequality fails to hold, we
Reject Null Hypothesis!
Our test statistic lies 2.576 standard deviations to the left of the mean of the sampling
distribution. If the null hypothesis were true, we would expect to observe a test statistic
this low (or lower) only about 0.4998% of the time---less than 1/2 of 1 percent.
T-test
 Used when , the standard deviation of the underlying
population is unknown.
 Instead of forming the test statistic with , we substitute an
estimate for , namely the sample deviation:
n
S 
 x i  x 
i 1
n 1
2
.
 The resulting test statistic is:
x  0
t 
.
S n
 The test statistic, t, has a t distribution with n-1 degrees of
freedom. In comparison, the test statistic, z, formed using the
population standard deviation, , has a normal distribution .
 The t distribution is shaped like the standard normal distribution
(eg. bell-shaped. but more spread out). Its mean is 0, and its
variance (whenever degrees of freedom > 2) is df/(df-2). As n
(and hence df) increases, the variance approaches 1 & the t
distribution approaches the standard normal distribution.
 When n is large, the z test is sometimes substituted (as a close
approximation) for the t test.
Assumptions: Either n is large enough for the central limit
theorem to hold or the underlying distribution is normal.
T-test & EXCEL
Suppose t has a t distribution with n-1 degrees of freedom.
We define t1 
2 ,n 1
to be the number satisfying:


P t  t1 2 ,n 1  1   2 .
That is, the area under the t distribution (probability) to the left of
t1 2 ,n 1 is 1-/2.
Note: This is the exact analogue of
Z 1  2
for the standard
normal distribution.
To calculate the value for t1 
2 ,n 1
in EXCEL, you use the TINV
function:
t1   2 ,n 1  TINV ( , n 1).
EXAMPLE: 1-Tailed t-Test
Average weekly earnings of full time employees is reported to be
$344. You believe this value is too low. A random sample of
1200 employees yields a sample mean of $361 and a sample
deviation of $110. Formulate the appropriate null hypothesis and
analyze the data:
1. Formulate Null Hypothesis:
H0:  < 344, H1:  > 344
2. Compute Test Statistic:
t
x  0
361  344

 3.353612
S n 110 1200
3. Analysis: This is an extreme value for the test statistic, more
than 3 standard deviations away from the mean (if the null
hypotheis were true). For  = 0.001, we can calculate:
t1    TINV (2  ,1199 )  TINV 0.002 ,1199   3.097084 .
Since our test statistic is even larger, we would reject the null
hypothesis at the 99.9% significance level.
The p-value associated with our test statistic is given by:
p-value = TDIST(3.353612, 1199, 1) = 0.000411.
Thus, if the null hypothesis were true, the probability of obtaining
a test statistic as large (or larger) than 3.353612 is only 0.0411%.
Average weekly earnings are almost certainly larger than $344.
p-Values
Definition: The p-value is the smallest level of significance, , for
which a null hypothesis may be rejected using the obtained value
of the test statistic.

The p-value is the probability of obtaining a value of the test
statistic as extreme, or more extreme than, the actual test statistic,
when the null hypothesis is true.
Example: Your z-statistic in a z-test is 3.162. To calculate the pvalue, use the EXCEL function NORMSDIST. NORMSDIST(z) is
the probability of obtaining a test statistic value less than or equal
to z. To be more extreme, the test statistic would have to be larger
than 3.162. Thus,
p-value = 1-NORMSDIST(3.162) = 1-0.999216 = 0.000784
Example: Your z-statistic in a z-test is -0.4. To be more extreme,
the test statistic would have to be less than -0.4. Thus,
p-value = NORMSDIST(-0.4) = 0.3446
Example: Your t-statistic in a t test with 15 degrees of freedom is
4.56. To calculate the p-value use the EXCEL function TDIST.
TDIST(t,n-1,1) = P{test statistic value > t | null hypothesis true}.
(Note the different direction of the inequality!) Hence
p-value = TDIST(4.56,15,1) = 0.000188
Rules of Thumb for p-values
p-value
< 0.01
interpretation
very significant
between 0.01 & 0.05
significant
between 0.05 & 0.1
marginally significant
> 0.1
not significant
Chi-Square Tests
Like the T-distribution, the chi-square distribution is defined by its
number of degrees of freedom. A chi-square random variable with
k degrees of freedom is normally denoted by the symbol  k2 , and
is defined by the equation:
 k2
k
   i2 ,
i 1
That is, the sum of the squares of k standard normal random
variables. Since squares are always non-negative, so is their sum,
and hence a chi-square random variable can only take on nonnegative values. Illustrations of the PDF for 1 and 5 degrees of
freedom are shown below.
Probability Density Function
for the Chi-Sq Distribution
1
0.050
3.841
30
25
20
15
10
5
0
Degrees of Freedom:
P-Value:
Chi-Sq Critical Value:
Probability Density Function
for the Chi-Sq Distribution
30
25
20
15
10
5
0
Degrees of Freedom:
P-Value:
Chi-Sq Critical Value:
5
0.050
11.070
Values for the chi-square distribution can be referenced using
the EXCEL functions CHIDIST and CHIINV.
 CHIDIST(x, k) gives the probability that a chi-square
random variable with k degrees of freedom attains a value
greater than or equal to x. In other words, the area under
the PDF to the right of x. In the pictures above, this is
reported as the p-value.
 CHIINV(p, k) gives the inverse, or critical value. That is,
if p = CHIDIST(x, k), then CHIINV(p, k) = x. In the
pictures above, this is reported as the chi-sq critical value.
Examples:
CHIDIST(5,5) = 0.41588
CHIDIST(25,5) = 0.00139
CHIINV(0.41588, 5) = 5
CHIINV(0.00139, 5) = 25
Values can also be referenced using a table of chi-square
values.
 For example, to find the critical value for a chi-square
with 10 degrees of freedom at the 95% significance level,
use row 10 and the  = 0.05 column of the attached table
(giving a value of 18.31).
 Alternatively, using EXCEL, one could compute
CHIINV(0.05, 10) = 18.307.
Test for Population Variance
Sometimes it is of interest to draw inferences about the population
variance. The distribution used is the chi-square distribution with
n-1 degrees of freedom (where n = sample size), and
the test statistic is given by:
 
2
n 1s 2
 02
,
where s2 is the sample variance and the denominator is the value of
the variance stated in the null hypothesis.
Example: Heart Valves.
Without any sorting, the clearance was normally distributed with
mean of 0.005 and standard deviation of 0.000283 (which implies
a variance of 8 * 10-8).
One key indicator of process improvement is whether or not
process variability has been reduced. In this case, we would look
to see if the variance of the clearance dimension has been reduced
by sorting.
The null hypothesis in this case is that the variance has not been
reduced:
H0: s2  8 * 10 -8
A random sample of size 50 (after sorting by batches
of 50) yields a sample variance of :
2.308 * 10-9
Computing the test statistic yields:

2
n  1s

 02
2
49  2.308  109

 1.414
8  108
The critical value is found (for an alpha of 0.001) by:
CHIINV(0.999, 49) = 23.98.
We would reject the null hypothesis for any value of
the test statistic less than the critical value, 23.98.
Example: A machine makes small metal plates used in batteries.
The plate diameter is a random variable with a mean of 5 mm. As
long as the variance is at most 1.0, the production process is under
control & the plates are acceptable. Otherwise, the machine must
be repaired. The QC engineer wants, therefore, to test the
following hypothesis:
H0: s2 < 1.0
With a random sample of 31 plates, the sample variance is 1.62.
Solution: Computing the test statistic, we see that:
n  1s 2 30 1.62

 

 48 .6
2
2
2
0
1.00
 For a critical value of  = 0.05, the critical value is found
by:
CHIINV(0.05, 30)=43.77.
Since our test statistic lies to the right of the critical value,
we would reject the null hypothesis.
 The p-value is given by:
CHIDIST(48.6, 30) = 0.017257.
Thus, we would reject the null hypothesis for any value of 
> 0.017257, and fail to reject the null hypothesis for smaller
values.
Important Note: The use of the chi-square test on
variance requires that the underlying population be
normally distributed.
Chi-Square Test for Independence
It is often useful to have a statistical test that helps us to determine
whether or not two classification criteria, such as age and job
performance are independent of each other. The technique uses
contingency tables, which are tables with cells corresponding to
cross-classifications of attributes.
In marketing research, one place where the chi-square test for
independence is frequently used, such tables are called cross-tabs.
You will recall that we have previously used the pivot-table facility
within EXCEL to produce contingency or cross-tabs tables from
more unwieldy tabulations of raw data.
Example: A random sample of 100 firms is taken. For each firm,
we record whether the company made or lost money in its most
recent fiscal year, and whether the firm is a service or non-service
company. A 2 X 2 contingency table summarizes the data.
Profit
Loss
Total
Industry Type
Service
Non-service
42
18
6
34
48
52
Total
60
40
100
Using the information in the table, we want to investigate whether
the two events:
 the company made a profit in its most recent fiscal year,
and
 the company is in the service sector
are independent of each other.
Before stating the test, we need to develop a little bit of
notation:
r = number of rows in the table
c = number of columns in the table
Oij = observed count of elements in cell (i, j)
Eij = expected count of elements in cell (i, j) assuming that
the two variables are independent
Ri = total count for row i
Cj = total count for column j
The expected number of items in a cell is equal to the sample size,
n, times the probability of the event signified by the particular cell.
In the context of a contingency table, the probability associated
with cell (i, j) is the joint probability of occurrence of both events.
That is,
E  n  Pi  j.
ij
From the definition of independence, it follows that
E  n  Pi P j.
ij
From the row and column totals, we can estimate:
Pi 
R
,
n
P j 
C
i
j
n
Substituting in these estimates, we get:
The expected count in cell (i, j) is
E 
ij
RC
i
j
n
Example: Using the data from the contingency table of the
previous example, we can calculate:
E11 
R1C1 60  48

 28.8,
n
100
E12 
R1C2 60  52

 31.2,
n
100
E21 
R2 C1 40  48

 19.2,
n
100
E22 
R2 C2 40  52

 20.8.
n
100
The resulting table of expected counts is shown below:
Industry Type
Service
Non-service
28.8
31.2
19.2
20.8
Profit
Loss
The chi-square test statistic for independence is given by:
  
r
c
2
i 1
O
j 1
ij
E
E
ij

2
ij
With degrees of freedom:  r  1  c  1
Note that the chi-square test statistic is always non-negative. If the
observed counts exactly equal the expected (under the hypothesis
of independence) counts, then the value of the test statistic would
be zero. The greater the difference between observed and expected
counts, the larger the test statistic becomes.
We would reject the null hypothesis (of independence) only when
we obtain a sufficiently large value of the test statistic.
We can compute the chi-square test statistic for our example as
follows:

2
42  28.8

28.8
2
18  31.2

31.2
2
6  19.2

19.2
.
2
34  20.8

20.8
2
 29.09
The number of degrees of freedom is 1 degree of freedom. Using
the CHIINV function to compute a critical value for  = 0.01, we
see that CHIINV(0.01, 1) = 6.63. The rejection region (for a
confidence level of 99%) is any test statistic larger than 6.63.
Since our computed value of the statistic is much larger, we would
reject the null hypothesis. Alternatively, we could compute the pvalue by using the CHIDIST function:
CHIDIST(29.09, 1) = 6.91 * 10-8
from which we see that we would reject the null hypothesis for any
reasonable value of .
Using the CHITEST function:
An easier way to do this is to use the EXCEL CHITEST function.
To do this, you need to have two tables in your spreadsheet
 one containing the original contingency table data of
actual observed counts, and
 one containing the expected counts
The following spreadsheet information contains:
 the original counts in range A1:B2,
 the expected counts in range D1:E2,
 the chitest function formula in cell F1, and
 the value returned by the chitest function formula in cell
F2.
42 18
6
34
28.8 31.2 =CHITEST(A1:B2,D1:E2)
19.2 20.8 6.92162612220623E-08
Note that the value returned by the chi-test
function formula is the p-value.