Download Statistical Input Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Psychometrics wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Least squares wikipedia , lookup

Data assimilation wikipedia , lookup

Generalized linear model wikipedia , lookup

Computer simulation wikipedia , lookup

Transcript
1
Statistical
Distribution Fitting
Dr. Jason Merrick
Some Issues in Fitting Input
Distributions
• Not an exact science — no “right” answer
• Consider physical or logical process that
•
generates the data
Consider range of distribution
– Infinite both ways (e.g., normal)
– Positive (e.g., exponential, gamma)
– Bounded (e.g., beta, uniform)
• Consider ease of parameter manipulation to
•
affect means, variances - decision variables
Outliers, multimodal data
– Maybe split data set (see textbook for details)
– Consider theoretical vs. empirical
Simulation with Arena — Statistical Distribution Fitting
C5/2
Eyeballing
One way to see if a sample of data fits a distribution is to
–
–
–
–
draw a frequency histogram
estimate the parameters of the possible distribution
draw the probability density function
see if the two shapes are similar
frequency
•
data values
Simulation with Arena — Statistical Distribution Fitting
C5/3
Formalizes this notion of
distribution fit
– Oi represents the number of
observed data values in the ith interval.
– pi is the probability of a data
value falling in the i-th interval
under the hypothesized
distribution.
– So we would expect to
observe Ei = npi, if we have n
observations
Oi
data values
pi
pdf
•
frequency
Chi-Squared Test
data values
Simulation with Arena — Statistical Distribution Fitting
C5/4
Chi-Squared Test
• So the chi-squared statistic is
 Oi  Ei 

Ei 
i 1 
k
 02   
2
• By assuming that the Oi - Ei terms are normally
distributed,
– it can be shown that the distribution of the statistic is
approximately chi-squared with k-s-1 degrees of freedom
– s is the number of parameters of the distribution
• Hint: consider
O

k  i

p


i
 02    n

pi
i 1 



2
Simulation with Arena — Statistical Distribution Fitting
C5/5
Chi-Squared Test
• So the hypotheses are
•
– H0: the random variable, X, conforms to the distributional
assumption with parameters given by the parameter
estimates.
– H1: the random variable does not conform.
2

The critical value is then  ,k s1 , which is also the
100%-quantile of a gamma distribution with
scale 1/2 and shape (k-s-1)/2.
2
2



 ,k  s 1
– Reject if 0
• This gives a test with significance level .
– But what about the power of the test?
Simulation with Arena — Statistical Distribution Fitting
C5/6
Chi-Squared Test
• If the expected frequencies Ei are too small, then
the test statistic will not reflect the departure of
the observed from the expected frequencies.
– The test can reject because of noise
– In practice a minimum of Ei  5 is used
– If Ei is too small for a given interval, then adjacent intervals
can be combined
• For discrete distributions
– each possible discrete value can be a class interval
– combine adjacent values if the Ei’s are too small
Simulation with Arena — Statistical Distribution Fitting
C5/7
Chi-Squared Test
• For continuous data
– intervals that give equal probabilities should be used, not
equal length intervals
– this gives a better power for the test
• the power of test is the probability of rejecting a false hypothesis
– it is not known what probability gives the highest power, but
we want
Ei  5
npi  5
pi 
1
k
n
5
k
Simulation with Arena — Statistical Distribution Fitting
k
n
5
C5/8
Chi-Squared Test
• Example: the exponential distribution
– Suppose that we have n observations, possibly exponential
– We estimate that ˆ  1 using the data
X
– So we must use k  10 intervals, so we choose 8 to get p =
0.125
– To find the endpoints of the i-th interval, [ai-1,ai)
F (ai )  1  e  ̂ai  ip

e̂ai  1  ip
1
 ai   ˆ ln( 1  ip )

Simulation with Arena — Statistical Distribution Fitting
C5/9
Eyeballing
Another method of seeing if a distribution fits sample data
is the q-q plot
–
–
–
–
–
x is the q-quantile of a random variable X with cdf F if F(x)=q or x=F-1(q)
Take a data sample {x1,…xn} and order them to get y1  y2  ...  yn
yj is an estimate of the (j - 0.5)/n quantile
Plot yj versus F-1((j - 0.5)/n )
This should give a straight line
80
Exponential Quantile
•
70
60
50
40
30
20
10
0
0
5
10
15
20
Order Statistics
Simulation with Arena — Statistical Distribution Fitting
C5/10
Eyeballing
Note:
–
–
–
–
–
Will never actually be a straight line
Order statistics are not independent
One point above line will likely be followed by another
The variance at the extremes is larger
So for exponential, you will likely see more discrepancy at larger values
15
Exponential Quantile
•
10
5
0
0
5
10
Order
Statistics
Simulation with Arena
— Statistical
Distribution Fitting
15
C5/11
Kolomogorov-Smirnov Test
Formalizes the idea of a q-q
plot
– The scales are changed by
applying the CDF to each axis
– D+ = maxj {(j - 0.5)/n ) - F(yj)}
– D- = maxj {F(yj) - (j - 1 - 0.5)/n )}
– Note that there are no D+‘s for
some observations
– The test statistic is given by
D = max{D+, D-}
j
1
2
3
4
5
6
7
8
9
10
Order Statistics
0.0307
0.4838
1.3364
2.0778
2.1446
2.7039
3.0289
4.4276
4.9919
5.5265
1
0.75
j-0.5/n
•
0.5
0.25
0
0
Normal Quantile F(Order Statistic)
(j-0.5)/n
0.14
0.01
0.05
0.43
0.17
0.15
0.77
0.39
0.25
1.15
0.54
0.35
1.60
0.55
0.45
2.14
0.64
0.55
2.81
0.68
0.65
3.71
0.81
0.75
5.08
0.85
0.85
Simulation with Arena — Statistical Distribution Fitting
8.01
0.87
0.95
0.25
0.5
0.75
1
F(Order Statistics)
D+
0.04
-0.02
-0.14
-0.19
-0.10
-0.09
-0.03
-0.06
0.00
0.08
D0.12
0.24
0.29
0.20
0.19
0.13
0.16
0.10
0.02
C5/12
Comparing the Two Tests
• The Chi-Squared Test
– Not just a maximum deviation, but a sum of squared
deviations
– Uses more of the information in the data
– So it needs more data to be accurate
– Is more accurate if it has enough data
• The Kolmogorov-Smirnov Test
– Just a maximum deviation
– Needs less data to be accurate
– Is less accurate with more data
Simulation with Arena — Statistical Distribution Fitting
C5/13
Empirical Distribution
• “Fit” Empirical distribution (continuous or
discrete): Fit/Empirical
– Can interpret results as a Discrete or Continuous
distribution
• Discrete: get pairs (Cumulative Probability, Value)
• Continuous: Arena will linearly interpolate within the data range
according to these pairs (so you can never generate values outside
the range, which might be good or bad)
– Empirical distribution can be used when “theoretical”
distributions fit poorly, or intentionally
– When sampling from the empirical distribution, you are just
re-sampling from the data
Simulation with Arena — Statistical Distribution Fitting
C5/14
No Data?
• Happens more often than you’d like
• No good solution; some (bad) options:
– Interview “experts”
• Min, Max: Uniform
• Avg., % error or absolute error:
• Min, Mode, Max: Triangular
Uniform
– Mode can be different from Mean — allows asymmetry
– Interarrivals - independent, stationary
• Exponential - still need some value for mean
– Number of “random” events in an interval: Poisson
– Sum of independent “pieces”: normal
– Product of independent “pieces”: lognormal
Simulation with Arena — Statistical Distribution Fitting
C5/15
Multivariate and Correlated Input
Data
• Usually we assume that all generated random
observations across a simulation are
independent (though from possibly different
distributions)
• Sometimes this isn’t true:
– If a clerk starts to get long jobs, they may get tired and slow
down
– A “difficult” part requires long processing in both the Prep
and Sealer operations
• Ignoring such relations can invalidate model
Simulation with Arena — Statistical Distribution Fitting
C5/16
Checking for Auto-Correlation
• Suppose we have a series of inter-arrival times
– What is the relationship between the j-th observation and
the (j-1)st?
– What is the relationship between the j-th observation and
the (j-2)nd?
• We are talking about auto-correlation as the
series is correlated with itself
• How many steps back we are looking is called the
lag
Simulation with Arena — Statistical Distribution Fitting
C5/17
Auto-Correlation
Auto-Correlations
1
2
3
4
5
0.161689 -0.11597 -0.1101 0.020206 0.140538
1
7.636883
0.62654
10.54177
4.25373
6.015199
0.879388
0.728755
1.144225
0.409323
0.953624
3.772148
4.628748
7.916579
0.133024
0.264536
1.836931
7.046523
8.356191
6.451392
Lags
2
7.636883
0.62654
10.54177
4.25373
6.015199
0.879388
0.728755
1.144225
0.409323
0.953624
3.772148
4.628748
7.916579
0.133024
0.264536
1.836931
7.046523
8.356191
0.8
0.6
3
4
5
7.636883
0.62654
10.54177
4.25373
6.015199
0.879388
0.728755
1.144225
0.409323
0.953624
3.772148
4.628748
7.916579
0.133024
0.264536
1.836931
7.046523
7.636883
0.62654
10.54177
4.25373
6.015199
0.879388
0.728755
1.144225
0.409323
0.953624
3.772148
4.628748
7.916579
0.133024
0.264536
1.836931
7.636883
0.62654
10.54177
4.25373
6.015199
0.879388
0.728755
1.144225
0.409323
0.953624
3.772148
4.628748
7.916579
0.133024
0.264536
 2
0.4
Autocorrelation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
0
7.636883
0.62654
10.54177
4.25373
6.015199
0.879388
0.728755
1.144225
0.409323
0.953624
3.772148
4.628748
7.916579
0.133024
0.264536
1.836931
7.046523
8.356191
6.451392
3.768201
1
0.2
0
-0.2
1
2
3
4
5
-0.4
-0.6
-0.8
-1
Lag
Standard deviation of auto-correlation
estimate is
Simulation with Arena — Statistical Distribution Fitting
1
nos. of observatio ns
C5/18
Time Series Models
• If the auto-correlation calculations show a
correlation, then you may have to use a timeseries model
• Such models are auto-regression models and
moving average models
• Using the auto-correlation and another concept
called the partial auto-correlation, you can fit
these models
• The details are too much for this course
Simulation with Arena — Statistical Distribution Fitting
C5/19
Multivariate Input Data
• A “difficult” part requires long processing in both
the Prep and Sealer operations
– The service times at the Prep and Sealer areas would be
correlated
– Some multivariate models are quite easy, for instance the
multivariate normal model
– You can also use the multiplication rule, to specify the
marginal distribution of one time and then specify the other
time conditional on the first time
f X ,Y ( x, y )  f X |Y ( x | y ) fY ( y )
Simulation with Arena — Statistical Distribution Fitting
C5/20