Download Chi-Square Test

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
SIMULATION MODELING AND ANALYSIS
WITH ARENA
T. Altiok and B. Melamed
Chapter 7
Input Analysis
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
1
Input Analysis Activities
• Input Analysis activities consist of the following stages:
Stage 1:
Stage 2:
Stage 3:
Stage 4:
data collection
data analysis
modeling time series data
goodness-of-fit testing
• Random variables with negligible variability are simplified
and modeled as deterministic quantities.
• Unknown distributions are postulated to have a particular
functional form that incorporates any available partial
information.
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
2
Data Collection
•
To illustrate data collection activities, consider modeling a
painting station, where
•
•
•
•
•
jobs arrive at random, wait in the buffer until the sprayer is available
having been sprayed, they leave the station
suppose that the spray nozzle can get clogged – an event that
results in a stoppage during which the nozzle is cleaned or replaced.
suppose further that the measure of interest is the expected job delay in
the buffer.
The data collection activity in this simple case would consist of
the following tasks:
1.
2.
3.
4.
collection of job inter-arrival times
collection of painting times
collection of times between nozzle clogging
collection of nozzle cleaning/replacement times
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
3
Data Analysis
•
Data Analysis deals with statistics of empirical data:
•
•
•
•
statistics related to moments (mean, standard deviation, coefficient of variation, etc.)
statistics related to distributions (histograms)
statistics related to temporal dependence (autocorrelations within an empirical time
series, or cross-correlations among two or more distinct time series)
For example, consider the sample of 100 repair time observations
12.9
20.9
30.0
17.0
11.0
10.3
10.9
21.0
22.8
10.8
20.5
22.2
14.3
13.3
28.6
19.4
18.9
16.7
12.7
19.5
27.7
26.6
27.4
21.7
27.5
18.0
27.0
21.3
25.9
10.3
19.4
25.5
29.9
24.0
26.9
27.4
11.9
28.5
18.1
11.9
13.5
29.1
18.8
13.7
22.5
11.5
24.2
23.1
22.4
15.1
10.9
17.2
17.8
29.7
20.7
22.5
13.2
19.9
15.0
22.9
13.7
22.4
25.3
15.5
27.1
14.1
25.6
15.8
13.8
19.0
24.1
10.9
19.8
18.1
22.0
28.3
10.9
18.5
21.0
23.2
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
22.2
10.7
15.0
23.2
25.2
24.0
22.4
13.2
16.6
27.9
10.9
15.6
17.6
28.4
16.8
27.1
22.1
16.5
25.7
18.9
4
Data Analysis Example
•
Data Analysis of the repair time data produced the histogram and summary
statistics shown below
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
5
Modeling Time Series Data
•
Independent observations are modeled as a renewal time series, namely,
a sequence of iid random variables. In this case, the analyst’s task is to merely
identify (fit) a “good” distribution and its parameters to the empirical data.
•
•
Dependent observations are modeled as random processes with temporal
dependence. In this case, the analyst’s task is to identify (fit) a “good”
probability law to empirical data. This is a far more difficult task
than the previous one, and often requires advanced mathematics.
•
•
•
Arena provides built-in facilities for fitting distributions to empirical data.
Arena does not provide facilities for fitting dependent random processes
An advanced method is described, however, in Chapter 10
Examples:
•
Observed sequences of arrival times to a queue are often modeled as iid
exponential inter-arrival times (i.e., Poisson processes)
•
For observed sequence of times to failure and the corresponding repair times,
the associated uptimes may be modeled as a Poisson process, and the downtimes
as a renewal process or as a dependent process (e.g., Markov process)
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
6
Modeling Empirical Distributions
•
The simplest approach is to construct a histogram from the empirical data
(sample), and then normalize it to a step pdf or a pmf, depending on the
underlying state space. The obtained pdf or pmf is then declared to be the
fitted distribution.
The main advantage of this approach is that no assumptions are required on
the functional form (shape) of the fitted distribution.
•
The previous approach may reveal (by inspection) that the histogram pdf
has a particular functional form (e.g., decreasing, bell shape, etc.).
The analyst may then try to obtain a better fit, by postulating a particular
class of distributions having that shape, and then proceeding to estimate
(fit) its parameters from the sample, using such common techniques as the
method of moments and the maximum likelihood estimation (MLE) method.
This approach can be further generalized to multiple functional forms by
searching for the best fit among a number of postulated classes of
distributions.
•
The Arena Input Analyzer provides facilities for both fitting approaches.
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
7
Method of Moments
•
The method of moments fits the moments of a candidate model to sample
moments, using appropriate empirical statistics as constraints on the
candidate model parameters.
•
As an example, consider a random variable X and a data sample whose first
two moments, m 1 and m 2 are estimated as m̂ = 8.5 and m̂ = 125.3 .
1
•
2
Write the formulas for the mean and variance of a gamma distribution,
connecting the first two moments of a gamma distribution with its parameters,
and a , namely b
mˆ = a b
1
mˆ = a b (1 + b )
2
•
Substitute into the above the previous estimates
aˆ bˆ = 8.5
aˆ bˆ (1 + bˆ ) = 125.3
•
Solve the above equation to obtain â = 0.62, bˆ = 13.74
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
8
Maximal-likelihood Estimation (MLE)
•
•
The Maximal-likelihood Estimation (MLE) method postulates a particular
class of distributions (e.g., normal, uniform, exponential, etc.), and then
estimates their parameters from the sample, such that the resulting parameters
give rise to the maximal likelihood (highest probability or density) of
obtaining the sample. More precisely,
•
Let f (x ; q ) be the postulated pdf, as a function of its ordinary argument, x ,
as well as the unknown parameter q (possibly be a vector of parameters, but here
is assume a scalar for simplicity)
•
Let x , ¼ , x
1
N
be a sample of independent observations
The MLE method estimates L (x , ¼ , x ; q ) via the likelihood function
1
N
L ( x , ¼ , x ; q ) = f ( x ; q ) f ( x ; q )L f ( x ; q )
1
N
1
2
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
N
9
MLE Method Examples
•
For the exponential distribution Expo(l ) with parameter qˆ = lˆ ,
•
the corresponding maximal likelihood function is
L (x 1, ¼ , x N ; l ) = l
•
- l x1
e
l
- l x2
e
L
l
- l xN
e
= l
N
- l
e
N
å xi
i =1
the log-likelihood function is
N
ln L (x , ¼ , x ; l ) = N ln(l - l å x )
1
N
i =1 i
•
the value of l
that maximizes ln L (x , ¼ , x ; l ) is obtained by
1
differentiating it with respect to l
•
N
and setting the derivative to zero, that is
N
d
N
ln L (x , ¼ , x ; l ) =
- å x = 0
1
N
i=1 i
dl
l
solving the above in l yields the maximal likelihood estimate
N
1
lˆ = N
=
x
å x
i=1 i
•
For the uniform distribution Unif(a,b), a similar computation yields the MLE
estimates aˆ = min{x : 1£ i £ N }, bˆ = max{x : 1£ i £ N }
i
i
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
10
The Arena Input Analyzer
The Arena Input Analyzer is a tool that fits a distribution to
sample data.
Distribution
Arena Name
Arena Parameters
Exponential
EXPO
Mean
Normal
NORM
Mean, StdDev
Triangular
TRIA
Min, Mode, Max
Uniform
UNIF
Min, Max
Erlang
ERLA
ExpoMean, k
Beta
BETA
Beta, Alpha
Gamma
GAMM
Beta, Alpha
Johnson
JOHN
G, D, L, X
Log Normal
LOGN
LogMean, LogStdDev
Poisson
POIS
Mean
Weibull
WEIB
Beta, Alpha
Continuous
CONT
P1, V1, …
Discrete
DISC
P1, V1, …
Arena-supported distributions and their parameters
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
11
Best-fit uniform distribution for the repair time data
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
12
Best-fit beta distribution for the repair time data
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
13
Best-fit gamma distribution for a sample of lead time data
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
14
Fit All Summary for a sample of lead time data
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
15
Goodness-of-Fit Tests for Distributions
•
Tests of goodness-of-fit for distributions determine the
likelihood that an empirical sample is drawn from a given
distribution
•
•
•
•
•
a statistical hypothesis is formulated
a statistic is computed from the empirical data
the distribution of the statistic is assumed known under the null
hypothesis, allowing the computation of the probability that it exceeds
the observed value
rejection or acceptance decisions can be taken at a given significance
level, but these are subject to Type I and Type II statistical errors
Common goodness-of-fit tests for distributions:
1.
2.
Chi-Square test
Kolmogorov-Smirnov test
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
16
Chi-Square Test
•
The Chi-Square test compares the empirical histogram density,
constructed from sample data, to a candidate theoretical
density
•
assume that the empirical sample x , ¼ , x is a set of N iid
1
N
realizations from an underlying (unknown) random variable, X .
•
this sample is used to construct an empirical histogram with J cells,
where cell j corresponds to the interval [l , r )
j
•
j
The estimator of the probability p j = Pr{ X Î [l j , r j )} of cell j is
pˆ =
j
N
N
j
,
j = 1, K , J
is the number of observations in cell j
•
N
•
it is commonly suggested to take N > 5 for statistical reliability)
j
j
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
17
Chi-Square Test (Cont.)
•
Let FX (x ) be some theoretical candidate distribution of the
random variable X whose goodness-of-fit is to be assessed
•
Compute the corresponding theoretical probabilities
p = Pr{ X Î [l , r )} = F (r ) - F (l ), j = 1, K , J
j
•
j
j
X
j
X
j
for continuous data we have
rj
p j = FX (r j ) - FX (l j ) = òl f X (x ) dx , j = 1, K , J
j
where f (x ) is the density of X
X
•
The Chi-square test statistic is then given by
c
2
J
(N j - N p j )2
j =1
N pj
= å
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
18
Chi-Square Test Example
•
As an example, consider the repair time sample data of size N = 100,
given earlier, for which a histogram with J = 10 cells was constructed by the
Input Analyzer
•
The table below displays the elements of the Chi-Square test for the repair data
Cell
Number
Cell
Interval
j
[l , r )
1
2
3
4
5
6
7
8
9
10
[10,12)
[12,14)
[14.16)
[16,18)
[18,20)
[20,22)
[22,24)
[24,26)
[26,28)
[28,30)
j
j
Number of
Relative Theoretical
Observations Frequency Probability
N
j
13
9
8
9
12
8
13
10
10
8
pˆ
j
0.13
0.09
0.08
0.09
0.12
0.08
0.13
0.10
0.10
0.08
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
p
j
0.10
0.10
0.10
0.10
0.10
0.10
0.10
0.10
0.10
0.10
19
Chi-Square Test Example (Cont.)
•
The histogram of the repair data suggests that a uniform distribution
Unif(a,b) is an acceptably good fit to the sample repair data
•
The parameters of the uniform distribution are estimated as:
aˆ = min{x : 1£ i £ N } = 10, bˆ = max{x : 1£ i £ N } = 30
2
•
i
i
The Chi-Square statistic computation yields
10
e = å [pˆ j - p j ]2 = ( 0.13 - 0.10) 2 + L + ( 0.08 - 0.10) 2 = 0.0036
2
j=1
(13 - 10) 2
(8 - 10) 2
c =
+L +
= 3. 6
10
10
2
•
A Chi-Square table shows that for significance level a = 0.10
and d = 10 - 2 - 1 = 7 degrees of freedom, the critical value is c = 12.0
•
Since the test statistic computed above is c = 3.6 < 12.0,
we accept the null hypothesis that the uniform distribution Unif(10,30)
is an acceptably good fit to the sample repair data
2
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
20
Kolmogorov-Smirnov Test
•
The Kolmogorov-Smirnov (K-S) test compares the empirical cdf
to a theoretical counterpart
•
•
while, the Chi-Square test requires a considerable amount of data
(at least to set up a reasonably “smooth” histogram), the K-S test can get
away with smaller samples, since it does not require a histogram
The K-S test procedure proceeds as follows:
•
sort the sample x 1, ¼ , x N is ascending order as x , ¼ , x
( 1)
(N )
•
constructs the empirical cdf
Fˆ (x ) =
max{ j : x
X
•
(j )
< x}
N
construct the K-S test statistic
K S = max{x : | Fˆ X (x ) - FX (x ) |}
The smaller is the observed value of KS, the better is the fit
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
21
Multi-Modal Distributions
•
A mode of a distribution is that value of its associated pdf or
pmf at which the respective function attains a maximal value
•
A uni-modal distribution has exactly one mode
•
A multi-modal distribution is one whose associated pdf or pmf is
of the following form:
1.
2.
It has more than one mode
It has only one mode, but it is either not monotone increasing to the left of
its mode, or not monotone decreasing to the right of its mode
Thus, a multi-modal distribution has a pdf or pmf with multiple “humps”
•
One approach to Input Analysis of multi-modal samples is:
1.
2.
3.
Separate the sample into mutually exclusive uni-modal sub-samples
Fit a separate distribution to each sub-sample
The fitted models are then combined into a final model according to the
relative frequency of each sub-sample
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
22
Multi-Modal Distribution Example
•
Consider a sample of N observations such that
•
N 1 observations appear to form a uni-modal distribution in an interval I 1
•
N 2 observations appear to form a uni-modal distribution in an interval I 2
•
N1+ N2 = N
•
Suppose that the theoretical distributions, F1(x ) and F2 (x ) ,
are fitted separately to the respective sub-samples
•
The combined distribution to be fitted the entire sample
is defined by
FX (x ) =
•
N1
N
F1(x ) +
N2
N
F2 (x )
The distribution above is a legitimate distribution, formed as a
probabilistic mixture of the two distributions, F1 (x ) and F2 (x )
Altiok / Melamed Simulation Modeling and Analysis with Arena
Chapter 7
23