Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computer simulation wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

SahysMod wikipedia , lookup

Corecursion wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Taylor's law wikipedia , lookup

Data assimilation wikipedia , lookup

Pattern recognition wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
CPSC 531:Input Modeling
Instructor: Anirban Mahanti
Office: ICT 745
Email: [email protected]
Class Location: TRB 101
Lectures: TR 15:30 – 16:45 hours
Class web page:
http://pages.cpsc.ucalgary.ca/~mahanti/teaching/F05/CPSC531
Notes from “Discrete-event System Simulation” by
Banks, Carson, Nelson, and Nicol, Prentice Hall,
2005, and
“Simulation Modeling and Analysis” by Law and
Kelton, McGraw Hill, 2000.
CPSC 531:Input Modeling
1
Outline
 Quality of output depends on the input models
driving the simulation
 This module discusses the following:
Data collection from the real system
 Hypothesize probability distributions
 Choose parameters for the distributions
 Goodness of fit test – how well does the fitted
distribution model available data
 Selecting distributions in absence of data
 Models of arrival process (Poisson Process, Nonstationary Poisson Process, Batch Arrivals)

CPSC 531:Input Modeling
2
Data Collection
 Plan ahead: begin by a practice or pre-observing






session, watch for unusual circumstances
Analyze the data as it is being collected: check
adequacy
Combine homogeneous data sets, e.g. successive time
periods, during the same time period on successive
days
Be aware of data censoring: the quantity is not
observed in its entirety, danger of leaving out long
process times
Check for relationship between variables, e.g. build
scatter diagram
Check for autocorrelation
Collect input data, not performance data
CPSC 531:Input Modeling
3
Identifying Probability Distributions
 Several possible techniques (may use a
combination of these)

Prior knowledge of the random variable’s role
• Inter-arrival times are exponential if arrivals occur one at
a time, have constant mean rate, and are independent
• Service times are not normally distributed because service
time can not be negative
• Product of many independent pieces may imply Lognormal
Use the physical basis of the distribution as guide
 Summary statistics
 Histograms

CPSC 531:Input Modeling
4
Distribution Guide
 Use the physical basis of the distribution as a guide,
for example:








Binomial: # of successes in n trials
Poisson: # of independent events that occur in a fixed amount
of time or space
Normal: dist’n of a process that is the sum of a number of
component processes
Exponential: time between independent events, or a process
time that is memoryless
Weibull: time to failure for components
Discrete or continuous uniform: models complete uncertainty
Triangular: a process for which only the minimum, most likely,
and maximum values are known
Empirical: resamples from the actual data collected
CPSC 531:Input Modeling
5
Summary Statistics
Sample mean : X(n)
Symmetricdistribution if X(n) and samplemedian are close
^
For continuousdistributions, estimatecv   /  by cv  S (n) / X(n);
cv  1 : candidatesare gamma or Weibullwith shape parameter  1
cv  1 : candidateexponential
cv  1 : candidatesare gamma or Weibullwith shape parameter  1
For discretedistributions, estimatelexis ratio   2 /  by S 2 (n) / X(n);
  1 : candidatebinomial
  1 : candidatePoisson
  1 : candidatenegativebinomialor geometric
See T able 6.5 in LK00
CPSC 531:Input Modeling
6
Histograms
 A frequency distribution or histogram is useful in
determining the shape of a distribution
 The number of class intervals depends on:



The number of observations
The dispersion of the data
Suggested: the square root of the sample size
 For continuous data:

Corresponds to the probability density function of a
theoretical distribution
 For discrete data:

Corresponds to the probability mass function
 If few data points are available: combine adjacent
cells to eliminate the ragged appearance of the
histogram
CPSC 531:Input Modeling
7
Histogram for Continuous Data
Histogram
 Sample n = 100 interarrival




Request arrival approximately
stationary – # of requests
arriving in 10-second periods
approx. equal
Sample mean = 0.534 s;
median = 0.398; CV = 0.98
exponential distribution?
Right-hand side shows two
histograms: top with interval
or bin size of 0.1 s; bottom
with bin size of 0.25 s.
Frequency
0.15
0.1
0.05
0
0.1
0.6
1.1
1.6
2.1
2.6
Bin
Histogram
0.35
0.3
Frequency
times of requests to a Web
server during a 1-minute
period (see Web page)
0.2
0.25
0.2
0.15
0.1
0.05
0
0.25
0.75
1.25
1.75
2.25
2.75
Bin
CPSC 531:Input Modeling
8
Histogram for Discrete Data
 Sample n = 100 observations of the number of items
requested from a job shop per week over a long time
period


(# req., # observations): {(0,1), (1,3), (2,8), (3,14), (4, 18),
(5,17), (6,16), (7,10), (8,8), (9,4), (10, 1)}
Mean = 4.94, variance = 4.4, Lexis ratio = 0.9
Poisson distribution?Histogram
0.2
0.16
0.12
h(x)

0.08
0.04
0
0
1
2
3
4
5
6
7
8
9
10
x
CPSC 531:Input Modeling
9
Parameter Estimation
 Next step after selecting a family of distributions
 If observations in a sample of size n are X1, X2, …, Xn
(discrete or continuous), the sample mean and variance
n
n
are:
2
2

X
i 1
Xi
n
S
2


i 1
X i  nX
n 1
 If the data are discrete and have been grouped in a
frequency distribution:
 j 1 f j X j
n
X
n
2
2
f
X

n
X
 j 1 j j
n
S2 
n 1
where fj is the observed frequency of value Xj
CPSC 531:Input Modeling
10
Parameter Estimation
 When raw data are unavailable (data are grouped into
class intervals), the approximate sample mean and
variance are:
 j 1 f j X j
c
X
n


n
S2
2
2
f
m

n
X
j 1 j j
n 1
where fj is the observed frequency of in the jth class
interval
mj is the midpoint of the jth interval, and c is the number of class
intervals
 A parameter is an unknown constant, but an estimator
is a statistic.
CPSC 531:Input Modeling
11
How Representative are the fits?
| F ( x) |tt  b
over the histogram and
look for similarities
0.4
0.35
Fitted Dist
0.3
Frequency
 Continuous data – plot
Histogram
^
0.25
0.2
0.15
0.1
0.05
0
 Discrete data – compare
0.25
0.75
1.25
1.75
2.25
2.75
Bin
the observed frequency
with the expected
frequency
Histogram
0.2
h(x)
0.16
0.12
0.08
 Try a Quantile-Quantile
0.04
Plot
0
0
1
2
3
4
5
6
7
8
9
10
x
Observed
CPSC 531:Input Modeling
12
Quantile-Quantile Plots
 Q-Q plot is a useful tool for evaluating distribution fit
 If X is a random variable with cdf F, then the q-
quantile of X is the g such that
F( g )  P(X  g )  q,

for 0  q  1
When F has an inverse, g = F-1(q)
 Let {xi, i = 1,2, …., n} be a sample of data from X and {yj,
j = 1,2, …, n} be the observations in ascending order:
 j - 0.5 
y j is approximately F -1 

 n 
where j is the ranking or order number
CPSC 531:Input Modeling
13
Quantile-Quantile Plots
 The plot of yj versus F-1( (j-0.5)/n) is


Approximately a straight line if F is a member of an
appropriate family of distributions
The line has slope 1 if F is a member of an appropriate family
of distributions with appropriate parameter values
CPSC 531:Input Modeling
14
Quantile-Quantile Plots
 Example: Check whether the door installation times
follows a normal distribution [BCNN05]

The observations are now ordered from smallest to largest:
j
1
2
3
4
5

Value
99.55
99.56
99.62
99.65
99.79
j
6
7
8
9
10
Value
99.98
100.02
100.06
100.17
100.23
j
11
12
13
14
15
Value
100.26
100.27
100.33
100.41
100.47
yj are plotted versus F-1( (j-0.5)/n) where F has a normal
distribution with the sample mean (99.99 sec) and sample
variance (0.28322 sec2)
CPSC 531:Input Modeling
15
Quantile-Quantile Plots [BCNN05]
 Example (continued): Check whether the door
installation times follow a normal distribution.
Straight line,
supporting the
hypothesis of a
normal distribution
Superimposed
density function of
the normal
distribution
CPSC 531:Input Modeling
16
Quantile-Quantile Plots [BCNN05]
 Consider the following while evaluating the linearity of
a q-q plot:



The observed values never fall exactly on a straight line
The ordered values are ranked and hence not independent,
unlikely for the points to be scattered about the line
Variance of the extremes is higher than the middle. Linearity
of the points in the middle of the plot is more important.
 Q-Q plot can also be used to check homogeneity
 Check whether a single distribution can represent both
sample sets
 Plotting the order values of the two data samples against
each other
CPSC 531:Input Modeling
17
Goodness-of-Fit Tests [BCNN05]
 Conduct hypothesis testing on input data distribution
using:


Kolmogorov-Smirnov (KS) test
Chi-square test
 No single correct distribution in a real application
exists.


If very little data are available, it is unlikely to reject any
candidate distributions
If a lot of data are available, it is likely to reject all candidate
distributions
CPSC 531:Input Modeling
18
Chi-Square Test [BCNN05]
 Compare histogram of the data to the shape of the
candidate distribution function
 Valid for large sample sizes when parameters are
estimated by maximum likelihood
 Arrange the n observations into a set of k class
intervals or cells, the test statistics is:
 02
Observed
Frequency
(Oi  Ei ) 2

Ei
i 1
k
Expected Frequency
Ei = n*pi
where pi is the theoretical
prob. of the ith interval.
Suggested Minimum = 5
which approximately follows the chi-square distribution with k-s-1
degrees of freedom, where s = # of parameters of the hypothesized
distribution estimated by the sample statistics.
CPSC 531:Input Modeling
19
Chi-Square Test
 Null hypothesis – observations come from a specified
distribution cannot be rejected at a significance of α if:
02  [21 , k  s 1]
Obtained from a table
 Comments:



Errors in cells with small Ei’s affect the test statistics more
than cells with large Ei’s.
Minimum size of Ei debated: [BCNN05] recommends a value of
3 or more; if not combine adjacent cells.
Test designed for discrete distributions and large sample sizes
only. For continuous distributions, Chi-Square test is only an
approximation (i.e., level of significance holds only for n->∞).
CPSC 531:Input Modeling
20
Chi-Square Test
 Example 1: 500 random numbers generated using a random
number generator; observations categorized into cells at
intervals of 0.1, between 0 and 1. At level of significance of 0.1,
are these numbers IID U(0,1)?
Interval
1
2
3
4
5
6
7
8
9
10
Oi
Ei
50
48
49
42
52
45
63
54
50
47
500
50
50
50
50
50
50
50
50
50
50
[(Oi-Ei)^2]/Ei
0
0.08
0.02
1.28
0.08
0.5
3.38
0.32
0
0.18
5.84
 02  5.85; from the table [20.9,9]  14.68;
Hypothesisacceptedat significance levelof 0.10.
CPSC 531:Input Modeling
21
Chi-Square test [BCNN05]
 Example 2: Vehicle Arrival
H0: the random variable is Poisson distributed.
H1: the random variable is not Poisson distributed.

xi
Observed Frequency, Oi
Expected Frequency, Ei
0
1
2
3
4
5
6
7
8
9
10
> 11
12
10
19
17
19
6
7
5
5
3
3
1
100
2.6
9.6
17.4
21.1
19.2
14.0
8.5
4.4
2.0
0.8
0.3
0.1
100.0
(Oi - Ei )2/Ei
7.87
0.15
0.8
4.41
2.57
0.26
11.62
Ei  np ( x )
e   x
n
x!
Combined because
of min Ei
27.68
Degree of freedom is k-s-1 = 7-1-1 = 5, hence, the hypothesis
is rejected at the 0.05 level of significance.
 02  27.68   02.05,5  11.1
CPSC 531:Input Modeling
22
Chi-Square Test
 If the distribution tested is continuous:
pi 

ai
ai1
f ( x) dx  F (ai )  F (ai 1 )
where ai-1 and ai are the endpoints of the ith class interval
and f(x) is the assumed pdf, F(x) is the assumed cdf.


Recommended number of class intervals (k):
Sample Size, n
Number of Class Intervals, k
20
Do not use the chi-square test
50
5 to 10
100
10 to 20
> 100
n1/2 to n/5
Caution: Different grouping of data (i.e., k) can affect the
hypothesis testing result.
CPSC 531:Input Modeling
23
Kolmogorov-Smirnov (KS) Test
 Difference between observed CDF F0(x) and expected
CDF Fe(x) should be small; formalizes the idea behind
the Q-Q plot.
 Step 1: Rank observations from smallest to largest:
Y1 ≤ Y2 ≤ Y3 ≤ … ≤ Yn
 Step 2: Define Fe(x) = (#i: Yi ≤ x)/n
 Step 3: Compute K as follows:
K  max | Fe ( x)  Fo ( x) |
x
j
j 1
K  max {  Fe (Y j ), Fe (Y j ) 
}
n
1 j  n n
CPSC 531:Input Modeling
24
KS Test
 Example: Test if given population is exponential with parameter β
= 0.01; that is Fe(x) = 1 – e–βx; K[0.9,15] = 1.0298.
KS Test for exponential distribution
beta
0.01 N
15
Y_j
j
(j/n)-F(Yj) F(Yj)-(j-1)/n
5
1 0.017896 0.048771
6
2 0.075098 -0.00843
6
3 0.141765
-0.0751
17
4 0.110331 -0.04366
25
5 0.112134 -0.04547
39
6 0.077057 -0.01039
60
7 0.015478 0.051188
61
8 0.076684 -0.01002
72
9 0.086752 -0.02009
74
10 0.143781 -0.07711
104
11 0.086788 -0.02012
150
12 0.02313 0.043537
170
13 0.04935 0.017316
195
14 0.075607 -0.00894
229
15 0.101266
-0.0346
MAX
0.143781 0.051188
CPSC 531:Input Modeling
25
KS Test
 KS test suitable for small samples, continuous
as well as discrete distributions
 KS test, unlike the Chi-Square test, uses each
observation in the given sample without
grouping data into cells (intervals).
 KS test is exact provided all parameters of
the expected distribution are known.
CPSC 531:Input Modeling
26
Selecting Model without Data
 If data is not available, some possible sources to
obtain information about the process are:




Engineering data: often product or process has performance
ratings provided by the manufacturer or company rules
specify time or production standards.
Expert option: people who are experienced with the process or
similar processes, often, they can provide optimistic,
pessimistic and most-likely times, and they may know the
variability as well.
Physical or conventional limitations: physical limits on
performance, limits or bounds that narrow the range of the
input process.
The nature of the process.
 The uniform, triangular, and beta distributions are
often used as input models. [ see LK00 for details]
CPSC 531:Input Modeling
27
Models of Arrival Processes
 Poisson Process
 Non-stationary Poisson
 Batch Arrivals
CPSC 531:Input Modeling
28
Poisson Process
 Definition: Let N(t) denote the number of arrivals that occur
in time interval [0,t].
 The stochastic process {N(t), t>=0} is a Poisson process with
mean rate l if:





N(0) = 0
Arrivals occur one at a time
{N(t), t>=0} has stationary increments – number of arrivals in a
given interval depends only on the length of the interval, not its
location
{N(t), t>=0} has independent increments – number of arrivals in
disjoint time intervals are independent.
And …
P{N (h)  1}
lim
l
h
P{N (h)  2}
0
lim
h
h o
h o
CPSC 531:Input Modeling
29
Poisson Process: Interarrival Times
 Consider the interarrival times of a Possion process (A1, A2, …),
where Ai is the elapsed time between arrival i and arrival i+1


The 1st arrival occurs after time t iff there are no arrivals in the
interval [0,t], hence:
P{A1 > t} = P{N(t) = 0} = e-lt
P{A1 <= t} = 1 – e-lt
[cdf of exp(l)]
Interarrival times, A1, A2, …, are exponentially distributed and
independent with mean 1/l
Arrival counts
~ Poisson(l)
Stationary & Independent
Interarrival time
~ Exp(1/l)
Memoryless
CPSC 531:Input Modeling
30
Poisson Process: Splitting and Pooling
 Splitting:
 Suppose each event of a Poisson process can be classified as
Type I, with probability p and Type II, with probability 1-p.
 N(t) = N1(t) + N2(t), where N1(t) and N2(t) are both Poisson
processes with rates l p and l (1-p)
N(t) ~ Poisson(l)
lp
l
N1(t) ~ Poisson[lp]
N2(t) ~ Poisson[l(1-p)]
l(1-p)
 Pooling:
 Suppose two Poisson processes are pooled together
 N1(t) + N2(t) = N(t), where N(t) is a Poisson processes with
rates l1 + l2
N1(t) ~ Poisson[l1]
N2(t) ~ Poisson[l2]
l1
l 1  l2
N(t) ~ Poisson(l1  l2)
l2
CPSC 531:Input Modeling
31
Non-stationary Poisson
Process (NSPP)
 Poisson Process without the stationary increments, characterized
by l(t), the arrival rate at time t.
 The expected number of arrivals by time t, L(t):
Λ(t) 
t
 λ(s)ds
0
 Relating stationary Poisson process n(t) with rate
l1 and NSPP
N(t) with rate l(t):
 Let arrival times of a stationary process with rate l = 1 be t1,
t2, …, and arrival times of a NSPP with rate l(t) be T1, T2, …, we
know:
ti = L(Ti)
Ti = L1(ti)
CPSC 531:Input Modeling
32
Non-stationary Poisson
Process (NSPP)
 Example: Suppose arrivals to a Post Office have rates 2 per
minute from 8 am until 12 pm, and then 0.5 per minute until 4 pm.
 Let t = 0 correspond to 8 am, NSPP N(t) has rate function:
2, 0  t  4
l (t )  
0.5, 4  t  8
Expected number of arrivals by time t:
0t 4
2t ,
t
t
L(t )   4
2
ds

0
.
5
ds

 6, 4  t  8
4
0
2
 Hence, the probability distribution of the number of arrivals
between 11 am and 2 pm.
P[N(6) – N(3) = k]
= P[N(L(6)) – N(L(3)) = k]
= P[N(9) – N(6) = k]
= e(9-6)(9-6)k/k! = e3(3)k/k!
CPSC 531:Input Modeling
33
Batch Process
 Let N(t) be the number of batches that have arrived
by time t.
 If interarrival times of batches are IID exponential
random variables, {N(t), t≥0} can be modeled as a
Poisson process.
 Let X(t) = total number of individual customers to
arrive by time t; let Bi = number of customers in the
ith batch; then
N (t )
X (t )   Bi , t  0
i 1
 If Bi’s are IID random variables independent of {N(t) t≥0}, and
if {N(t), t≥0} is a Poisson process, then the stochastic process
{X(t), t≥0} is a Compound Poisson process
CPSC 531:Input Modeling
34