Download sampling - Lyle School of Engineering

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Gibbs sampling wikipedia , lookup

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
Systems Engineering Program
Department of Engineering Management, Information and Systems
EMIS 7370/5370 STAT 5340 :
PROBABILITY AND STATISTICS FOR SCIENTISTS AND ENGINEERS
Sampling and Sampling Distributions
Dr. Jerrell T. Stracener, SAE Fellow
Leadership in Engineering
1
Population vs. Sample
Population
the total of all possible values (measurement,
counts, etc.) of a particular characteristic for a
specific group of objects.
Sample
a part of a population selected according to some
rule or plan.
Why sample?
- Population does not exist
- Sampling and testing is destructive
2
Sampling
Characteristics that distinguish one type of sample
from another:
• the manner in which the sample was obtained
• the purpose for which the sample was obtained
3
Types of Samples
• Simple Random Sample
The sample X1, X2, ... ,Xn is a random sample if
X1, X2, ... , Xn are independent and identically
distributed random variables.
Remark: Each value in the population has an
equal and independent chance of being included
in the sample.
•Stratified Random Sample
The population is first subdivided into
sub-populations for strata, and a simple random
sample is drawn from each strata
4
Types of Samples (continued)
Censored Samples
• Type I Censoring - Sample is terminated at a
fixed time, t0. The sample consists of K times to
failure plus the information that n-k items
survived the fixed time of truncation.
• Type II Censoring - Sampling is terminated
upon the Kth failure. The sample consists of K
times to failure, plus information that n-k items
survived the random time of truncation, tk.
• Progressive Censoring - Sampling is reduced in
stage.
5
Types of Samples (continued)
• Systematic Random Sample
The N items in the population are arranged in
some order.
Select an item at random from the first K = N/n
items, where n is the sample size.
Select every Kth item thereafter.
6
Sampling Monte Carlo Simulation
7
Uniform Probability Integral Transformation
For any random variable Y with probability density
function f(y), the variable
y
F ( y) 
 f ( x)dx

is uniformly distributed over (0, 1), or F(y) has the
probability density function
gF ( y)  1
for 0  y  1
8
Uniform Probability Integral Transformation
Remark: the cumulative probability distribution
function for any continuous random variable is
uniformly distributed over the interval (0, 1).
9
Generating Random Numbers
f(y)
y
F(y)
ri
1.0
0.8
0.6
0.4
0.2
0
y
yi
10
Generating Random Numbers
Generating values of a random variable using the
probability integral transformation to generate a
random value y from a given probability density
function f(y):
1. Generate a random value rU from a uniform
distribution over (0, 1).
2. Set rU = F(y)
3. Solve the resulting expression for y.
11
Generating Random Numbers with Excel
From the Tools menu, look for Data Analysis.
12
Generating Random Numbers with Excel
If it is not there, you must install it.
13
Generating Random Numbers with Excel
Once you select Data Analysis, the following window will
appear. Scroll down to “Random Number Generation”
and select it, then press “OK”
14
Generating Random Numbers with Excel
Choose which distribution you would like. Use uniform for
an exponential or weibull distribution or normal for a
normal or lognormal distribution
15
Generating Random Numbers with Excel
Uniform Distribution, U(0, 1).
Select “Uniform” under the “Distribution” menu.
Type in “1” for number of variables and 10 for number of
random numbers. Then press OK. 10 random numbers
of uniform distribution will now appear on a new chart.
16
Generating Random Numbers with Excel
Normal Distribution, N(μ, σ).
Select “Normal” under the “Distribution” menu.
Type in “1” for number of variables and 10 for number of
random numbers. Enter the values for the mean (m) and
standard deviation (s) then press OK. 10 random numbers
of uniform distribution will now appear on a new chart.
17
Generating Random Values from an Exponential
Distribution E() with Excel
First generate n random variables, r1, r2, …, rn, from
U(0, 1).
Select “Uniform” under the “Distribution” menu.
Type in “1” for number of variables and 10 for number of
random numbers. Then press OK. 10 random numbers
of uniform distribution will now appear on a new chart.
18
Generating Random Values from an Exponential
Distribution E() with Excel
Select a θ that you would like to use, we will use θ = 5.
Type in the equation xi= -ln(1 - ri), with filling in θ as 5, and ri as cell
A1 (=-5*LN(1-A1)). Now with that cell selected, place the cursor over the
bottom right hand corner of the cell. A cross will appear, drag this
cross down to B10. This will transfer that equation to the cells below.
Now we have n random values from the exponential distribution with
parameter θ=5 in cells B1 - B10.
19
Generating Random Values from an Weibull
Distribution W(β, ) with Excel
First generate n random variables, r1, r2, …, rn, from U(0, 1).
Select “Uniform” under the “Distribution” menu.
Type in “1” for number of variables and 10 for number of
random numbers. Then press OK. 10 random numbers of
uniform distribution will now appear on a new chart.
20
Generating Random Values from an Weibull
Distribution W(β, ) with Excel
Select a β and θ that you would like to use, we will use β =20, θ =
100.
Type in the equation xi = [-ln(1 - ri)]1/, with filling in β as 20, θ as 100,
and ri as cell A1 (=100*(-LN(1-A1))^(1/20)). Now transfer that equation to
the cells below. Now we have n random variables from the Weibull
distribution with parameters β =20 and θ =100 in cells B1 - B10.
21
Generating Random Values from an Lognormal
Distribution LN(μ, σ) with Excel
First generate n random variables, r1, r2, …, rn, from N(0, 1).
Select “Normal” under the “Distribution” menu.
Type in “1” for number of variables and 10 for number of
random numbers. Enter 0 for the mean and 1 for standard
deviation then press OK. 10 random numbers of uniform
distribution will now appear on a new chart.
22
Generating Random Values from an Lognormal
Distribution LN(μ, σ) with Excel
Select a μ and s that you would like to use, we will use μ = 2, σ = 1.
  ri
x

e
Type in the equation , i
with filling in μ as 2, σ as 1, and ri
as cell A1 (=EXP(2+A1*1)). Now transfer that equation to the cells
below. Now we have an Lognormal distribution in cells B1 - B10.
23
Flow Chart of Monte Carlo Simulation method
Input 1: Statistical distribution
for each component variable.
Select a random value from
each of these distributions
Input 2: Relationship
between component
variables and system
performance
Calculate the value of system
performance for a system
composed of components with the
values obtained in the previous step.
Repeat
n
times
Output: Summarize and plot resulting
values of system performance. This
provides an approximation of the
distribution of system performance.
24
Sample and Size Error Bands
Because Monte Carlo simulation involves randomly
selected values, the results are subject to statistical
fluctuations.
• Any estimate will not be exact but will have an
associated error band.
• The larger the number of trials in the simulation,
the more precise the final results.
• We can obtain as small an error as is desired by
conducting sufficient trials
• In practice, the allowable error is generally specified,
and this information is used to determine the required trials
25
Example
If X~ B(n,p) and the desired confidence level is 95%,
then 1 -  = 0.95 and  = 0.05 and Z1-/2 = 1.96;
and if P ' = 0.2. Then an estimate of the required sample
size is

0.20.8
2
1.96  246
n
2
0.05
26
Drawbacks of the Monte Carlo Simulation
• there is frequently no way of determining
whether any of the variables are dominant or
more important than others without making repeated
simulations
• if a change is made in one variable, the entire
simulation must be redone
• the method may require developing a
complex computer program
• if a large number of trials are required, a great
deal of computer time may be needed to obtain
the necessary results
27
Example
If the probability density function of X is
f (x) 
2(1  x)
0
for 0  x  1
elsewhere
Find
(a)
F(x)
(b)
Mean
(c)
Standard Deviation
(d)
The value of x for which P(X > x)=0.05
(e)
If 5 values of x are randomly selected find the
probability that at least 2 of them will exceed 0.6
(f) Redo parts (a) thru (e) using Monte Carlo Simulation
28
Example - Solution
First, plot f (x ) :
2
f(x)
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
x
29
Example - Solution
(a) The (cumulative) probability distribution function of
X for 0  x  1 is
F( x)  P( X  x )
x

 f ( x)dx

x
  2(1  y ) dx
0
x
x
0
0
 2  dy  2  ( y ) dy
 2 y 
x
0
x
y 
 2 
 2 0
 2x  x2
2
30
Example - Solution
so that
F (x) 
0
2x  x2
1
for x  0
for 0  x  1
for x  1
31
Example - Solution
(b) The mean of X is
1
  E (X )   x  2(1  x)dx
0
1
 2  [ x  x 2 ]dx
0
1
x
x 
 2  
 2 3 0
 1 1
 2  
 2 3
1

3
2
3
32
Example - Solution
The variance of X is
1
2
1
1
9
1
2
2
2
2
  Var( X )  E( X )     x f ( x)dx   
 3
0
  x 2  2(1  x)dx 
0
1
x
x  1
 2   
 3 4 0 9
2 1
1 1 1
 2    2  
 3 4  3 12 9
1

18
3
4
33
Example - Solution
The standard deviation is
1
  VAR ( X ) 
18
1

18
 0.236
34
Example - Solution
(d) The value of x such that P(X > x) = 0.05 can be
determined by a couple of different approaches.
x can be obtained by solving the following equation for x,
1
P( X  x)   f ( y )dy  0.05
x
or by solving F(x) = 0.95 for x,
2

0
.
95

2
x

x
F (X )
35
Example - Solution
2
x
 2 x  0.95  0
Here
x  1.2236 or x  0.7764
its roots are
f(x)
1.2236 is outside of our range, so x  0.7764 is our
answer. If we check with our plot of the data, this seems
reasonable.
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0.05
0
0.2
0.4
0.6
x
0.8
1
0.7764
36
Example - Solution
(e) Let Y = number of values that exceed 0.6, for y =
0,1,2,3,4,5.
P( X  0.6)  1  P( X  0.6)
 1  F (0.6)
 1  2(0.6)  0.6 
 1  0.84
2
 0.16
Now Y ~ B5,0.16
37
Example - Solution
so that
1
P(Y  2)  1   bx;5,0.16 
y 0
5
y
5 y
 1    0.16  1  0.16 
y 0  y 
 1  0.4182  0.3983
1
 0.1835
38
Example - Solution
(f) Generate a random sample of n, say 1,000,
from f (x) using Monte Carlo Simulation as
follows:
Since
F ( x)  2 x  x 2 for 0  x  1,
generate
ri from U0,1
and solve for xi
2 xi  xi  ri for i  1,...,1000
2
39
Example - Solution
Then estimate F(x), μ, σ and PY  2 as follows:
10
fi
ˆ
F ( x)  
for 0  x  1
k 1 1000
f(x)
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.6
0.6-0.7
0.7-0.8
0.8-0.9
0.9-1.0
0.0-1.0
Frequency, f i
196
170
136
119
103
96
78
56
35
11
1000
fi
1000
0.196
0.17
0.136
0.119
0.103
0.096
0.078
0.056
0.035
0.011
fi
 1000
0.196
0.366
0.502
0.621
0.724
0.82
0.898
0.954
0.989
1
1
0.2
relative frequency
Interval
0.15
0.1
0.05
0
0- 0.1- 0.2- 0.3- 0.4- 0.5- 0.6- 0.7- 0.8- 0.90.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
40
Example - Solution
Then estimate F(x), μ, σ and PY  2as follows:
10
fi
ˆ
F ( x)  
for 0  x  1
k 1 1000
F(x)
1
Fˆ ( x)
0.8
0.6
0.4
0.2
0
0- 0.1- 0.2- 0.3- 0.4- 0.5- 0.6- 0.7- 0.8- 0.90.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
x
41
Example - Solution
1 1000
μ̂ 
x1

1000 i 1
1
340.79

1000
 0.34079
1
Compare this to  =
3
42
Example - Solution
σ̂  S
where S 
2
n 1
n
n x    xi 
2
i
2
nn  1
1000175.93  116139.34

1000999
 0.0599
S  S2
 0.0599
 0.2446
43
Example - Solution
σ̂  S
n 1
n
999
 0.2446
1000
 0.2445
Compare this to  = 0.236 .
44
Example - Solution
p̂  P̂X  0.6 
no. of values of x  0.6

total number of values
180

1000
 0.18
Compare this to the
p  0.16
45
Example - Solution
no. the groups of 5 that have  2 values of x  0.6
ˆ
P(Y  2) 
total number of groups
29  11  0  0

200
 0.20
Compare this to the P  0.1835
46
Example - Solution - Our Data
ri
xi
0.38200
0.10068
0.59648
0.89911
0.88461
0.95846
0.01450
0.40742
0.86325
0.13858
0.24503
0.04547
0.03238
0.16413
0.21961
0.01709
0.28504
0.34309
0.55364
0.35737
0.37184
0.35560
0.91031
0.46602
0.42616
0.30390
0.97571
0.80667
0.21387
0.05168
0.36477
0.68236
0.66031
0.79620
0.00727
0.23021
0.63020
0.07188
0.13111
0.02300
0.01632
0.08574
0.11660
0.00858
0.15445
0.18950
0.33190
0.19836
0.20743
0.19726
0.70051
0.26926
0.24248
0.16568
0.84414
0.56030
num in
group
>0.6
>0.6
0
0
0
1
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
2
2
Remember, that there are 1000 points of
data that we have used. To access our
data, just double click on the excel chart
to the left.
0
0
1
47
Sampling Distributions
48
Sampling Distribution of X with known 
If X1, X2, ... ,Xn is a random sample of size n
from a normal distribution with mean  and
known standard deviation ,
and if
1 n
X   Xi
n i 1
then
 σ 
X ~ N μ,

n

and
X μ
Z
~ N0,1
σ
n
49
Sampling Distributions: Example
The dollar amount per transaction, X, in the
Sporting Goods Department of a store has a normal
distribution with mean $75 and standard deviation
of $20. What is the probability that a random sample of 9
sales transactions will have an average over $85?
50
Sampling Distributions: Example - Solution
If X ~ N(75, 20), then
20 

X ~ N 75,

9





X  μ 85  75 

P X  85  P

20 
 σ


n
9




 PZ  1.5
 0.0668
51
Central Limit Theorem
If X is the mean of a random sample of size n,
X1, X2, …, Xn, from a population with mean  and
finite standard deviation , then if n   the
limiting distribution of
Z
X 

n
is the standard normal distribution.
52
Central Limit Theorem
Remark: The Central Limit Theorem provides the
basis for approximating the distribution of X with
a normal distribution with mean  and standard
deviation

n
The approximation gets better as n gets larger.
53
Central Limit Theorem - Example
A manufacturing process produces parts with a mean
diameter of 5 mm. An engineer conjectures that the
population mean is 5.0 mm, and an experiment is
conducted in which 100 parts are selected randomly and
measured. It is known that the population  = 0.1. The
experiment indicates a sample average diameter
X = 5.027 mm. Does this refute the engineer’s
conjecture?
Solution: Whether or not the data support or refute the
conjecture depends on the probability that data similar
to that obtained in this experiment can readily occur
when  = 5.0. In other words, how likely is it that one
can obtain X  5.027 with n = 100 if the mean is equal
to  = 5.0?
54
Solution
The probability that we choose to compute is given
by P[( X - 5)  0.027]. This is the same as asking,
if the mean is 5, what is the chance that it will
deviate by so much as 0.027?
P[( X  5)  0.027]  P[( X  5)  0.027]
 X 5

P[( X  5)  0.027]  P
 2.7 
 0.1 / 100

55
Solution (Continued)
Here we are simply standardizing the sample mean
according to the Central Limit Theorem.
 X 5

P
 2.7   P[ Z  2.7]
 0.1 / 100

 0.0035
Thus one would experience by chance a sample
mean that is 0.027 mm from the population
mean in only about 3.5 of 1000 experiments.
Therefore the sample data does not support the
engineer’s conjecture.
56
Sampling Distribution of
X
with Unknown 
Let X1, X2, ..., Xn be independent random variables
that have normal distribution with mean  and
unknown standard deviation . Let
n
1
X   Xi
n i 1
and


n
2
1
2
S 
Xi  X

n  1 i 1
Then the random variable
X μ
T
S
n
has a t-distribution with  = n - 1 degrees of freedom.
57
Sampling Distributions of S2
If S2 is the variance of a random sample of size n
taken from a normal population having
variance 2, then the statistic

2

n  1 s


2
2
n

i 1
X
i
X


2
2
has a chi-squared distribution with  = n - 1
degrees of freedom.
58
Example
A manufacturer of car batteries guarantees that his
product will last, on average, 3 years with a
standard deviation of 1 year. If five batteries have
lifetimes of 1.9, 2.4, 3.0, 3.5 and 4.2 years, is the
manufacturer still convinced that his batteries have
a standard deviation of 1 year? Assume that battery
lifetime follows normal distribution .
Solution: We first find the sample variance:

548.26  15

54
2
s
2
 0.815
59
Solution (Continued)
Then

2

40.815

 3.26
1
is a value from a chi-squared distribution with 4
degrees of freedom. Since 95% of the 2 values
with 4 degrees of freedom fall between 0.484 and 11.143,
the computed value with 2 = 1 is reasonable, and
therefore the manufacturer has no reason to suspect that
the standard deviation is other than 1 year.
60