Download Materials discussed during the two review sessions

Document related concepts

Regression analysis wikipedia , lookup

Inverse problem wikipedia , lookup

Pattern recognition wikipedia , lookup

Confidence interval wikipedia , lookup

Data assimilation wikipedia , lookup

Birthday problem wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
Review Session
Jehan-François Pâris
Agenda
Statistical Analysis of Outputs
 Operational Analysis
 Case Studies
 Linear Regression

How to use this presentation

Most problems have
 One
slide stating the problem
 One slide explaining how to solve the problem
 One slide allowing you to check your answer
You will learn more by trying first to do the
problems on your own than by reading
their solutions
 Do not forget either to review the problems
in the original notes

Statistical Analysis
of Outputs
The big picture

The problems
 Constructing
confidence intervals
 Handling auto correlated data

The tools
 Central-Limit
Theorem
 Wilson’s formula
 Batch means (and regeneration)
 RNG tricks
Confidence Intervals

Distinguish between
 CIs
for means
 CSIM does it for you
 CIs for proportions
 We are on our own

Major issue is independence of data points
 CSIM
uses batch means
Central Limit Theorem

If the n mutually independent random
variables x1, x2, …, xn have the same
distribution, and if their mean m and their
variance s2 exist then …
Central Limit Theorem

The random variable
1 n
xi  m

n i 1
s
n
is distributed according to the standard normal
distribution (zero mean and unit variance).
CI for means (I)

For large values of n, the (1-)%
confidence interval for m is given by

z / 2s
z / 2s 
,x 
x 

n
n 


with
F ( z / 2 )  1 

2
CI for means (II)

F(z) is taken from a table of the normal
distribution
 F(0.025)

For smaller values of n, we have to use
Student’s t random variable
 Wider

= 1.96
CIs
We replace s by the sample standard
deviation s
Example

We have
 100
observations for the waiting time
 xbar = 4.25 minutes
 s2 = 25
Example

We have
 100
observations for the waiting time
 xbar = 4.25 minutes
 s2 = 25

Answer is
 4.25
± 1.96 sqrt(25/100) = 4.25 ± 0.98
CI for proportions

A proportion represents the probability
P(X) for some fixed threshold 
 97%
of our customers have to wait less than
one minute

Distributed according to a binomial law
 Use
Wilson’s formula
Wilson’s formula

When n > 29, we can use the Wilson’s
interval

z2 / 2
z2 / 2
qˆ (1  qˆ ) z2 / 2
qˆ (1  qˆ ) z2 / 2 
 qˆ 

 z / 2

qˆ 
 z / 2

2
2
2
n
n
2
n
n
4n  q 
4n   1  
P


z2 / 2
z2 / 2
1
1


n
n


where z/2 = 1.96 for a 95% C.I.
Example

We have want to estimate the proportion
of packets that wait more than four slots
 400
observations
 40 packets waited more than four slots
Answer

Divisor:
1

+ 1.962/400  1.01 (instead of 1.0096)
Central term
 0.1

+ 1.962/(2×400)  0.105 (instead of 1.048)
Half width
(0.1×0.9)/400 + 1.962/(2×4002) )
 sqrt (0.09/400 + (4/800)/400)
 1/20 sqrt (0.09 +0.0025)  0.3/20 = 0.015
 sqrt(

Result is

(0.105 ± 0.015)/ 1.01 = 0.104 ± 0.015
Batch means (I)

Simulation data are often autocorrelated
 Packet
delays in ALOHA
 Waiting times in queues
…

Batch means reduce (but do not
completely eliminate) that effect
Batch means (II)
Group measurements into fixed-size
batches of consecutive data
 Compute mean of each batch
 If batches are large enough, these means
will be independent

 Can

use standard-limit theorem, …
In case of doubt, compute autocorrelation
function for successive batch means
Regeneration (I)

The idea
 Partition
simulation data into intervals such
that
 Data measured inside the same interval
might be correlated
 Data measured in different intervals are
independent
Regeneration (II)

How?
 System
goes to a regeneration point each
time
 Its queues become empty
 All the disk drives are operational
…
 Criterion is system specific
Streams

When you want to evaluate two different
configurations of a system, it is often good
idea to use separate random number
streams for arrivals and service times
 Arrival
times remain unchanged when we
change other parameters of the system
Operational Analysis
Single server (I)

We can measure
T
the length of the observation period
 A the number of arrivals during the
observation period
 B the total amount of busy times during the
observation period
 C the number of completions during the
observation period
Single server (II)

We can compute
l
= A/T
 X = C/T
 U = B/T
 S = B/C

There are two ways to compute U
U

the arrival rate
the output rate
the utilization
the mean service time
= B/T = (C/T )(B/C) = XS
In general A  C and l  X
Little’s law


If W is the total time spent by all tasks
inside the system over the observation
period, then
N
= W/T
R
= W/C
Since W/T = (C/T)(W/C) = XR, N = XR
This is important
A problem

An ice-cream parlor
 Observed
during 6 hours
 Visited by 120 customers
 Spend an average of 24 minutes inside

What is the average number of customers
inside the parlor?
Answer

We compute X and apply Little’s Law
Answer

We compute X and apply Little’s Law
X
= 120/6 = 20 customers/hour
 R = 24 minutes = 0.4 hours
 N = XR = 8 customers
If you did not get it

The 120 customers sent a total of 120×24
customer×minutes or 48 customer×hours
in the parlor
 48

customer×hours/6 hours = 8 customers
Same as having 8 customers spending six
hours each inside the parlor
Network of servers (I)
Open network
Arrivals
Departures
Network of servers (II)
Closed network
Arrivals
Departures
Operational Quantities

Keep same quantities as before but add
indices
0
for whole system
 k for individual servers

Two changes
 We
never care about the utilization of the
whole system
 We add number of visits Vk of each server
Operational quantities

Over the observation period, we measure
C
= the number of job completions
 Ck = the number of tasks completed by
device k

We define
 X0
= C/T = the system throughput
 Xk = Ck/T = the output rate at server k
 Vk = Ck/C = the visit count at server k
Important relationships

Ck = VkC
 Since
each job requires Vk visits, there are Vk
more server completions than job completions

Xk = Vk X0
 Same
property applies to throughputs
ni
System response time (I)

We define

Nbar = average number of jobs in the
system
 nbari = average number of jobs at device i

Nbar = Σi nbari
System response time (II)


Applying Little’s law, we have
R = Nbar/X0
and
nbari = RiXi = RiViX0
Hence
R = Σi ViRi
Note

This result is trivial
 The
total time spent by a job in the system is
the sum of the times spent at each server
 This includes the time spent waiting in the
server queues
Problem 1

A job requires
 100
ms of CPU time
 9 disk accesses
Each disk access takes 7 ms
 We want

 VCPU
and SCPU
Answer

We now that jobs get CPU first and last
 VCPU

= 10
Then
 SCPU
= 100/10 =10s
Bottleneck analysis (I)
A system has one CPU and one disk drive
 It processes transactions such that


 VCPU =
12 and SCPU = 5ms
 VDisk =
11 and SDISK = 8ms
What is the maximum system throughput?
Bottleneck analysis (II)
We compute first the maximum device
throughputs
 Maximum XCPU = 1/0.005 = 200 requests/s


Maximum Xdisk = 1/0.008 = 125 requests/s

Since Xi = Vi X0
 Maximum
throughput compatible with CPU
workload is 200/12 = 16.7 transactions/s
 Maximum throughput compatible with disk
workload is 125/11 = 11.4 transactions/s
Bottleneck analysis (III)

The disk is this the bottleneck
 It
has highest ViSi product
 Identifying feature of any bottleneck device

Increasing the system throughput might
require
 Sharing
disk requests with a second disk
 Increasing the efficiency of the system I/O
buffer
Problem 2
In the previous example, which device was
the bottleneck?
 What would be the throughput of the
system if the bottleneck utilization was
80%?

Answer

We compare
 VCPUSCPU
 VdiskSdisk
Answer

We compare
 VCPUSCPU
= 100ms
 VdiskSdisk = 9×7 = 63 ms

The CPU is the bottleneck
Answer

If the bottleneck was operating at 100%
utilization,
 It
could process one job each VCPUSCPU time
units
 Or 1/(VCPUSCPU) job per time unit

At UCPU utilization,
 It
will process UCPU/(VCPUSCPU) job per time
unit
Answer

X0 = UCPU/(VCPUSCPU) = 0.80/0.10 seconds
8
jobs/second
Systems with terminals
Whole
system
M Terminals
Interactive response time
formula

We have

M terminals
 Think time Z between the completion of a
job and the submission of the next job

Applying Little’s law to the whole system
M = (R + Z ) X0
then
R = M/X0 – Z
Very
Important
Problem 3

We have
M
= 50 users
 Z = 20 s
 X0 = 2 transactions/s

What is the system response time?
Answer

We apply R = M/X0 – Z
Answer

We apply R = M/X0 – Z and obtain
R = 50/2 – 20 = 5 seconds
Problem 4

A system
 Processes
5 transactions/seconds
 Has 60 users
 Achieves a response time of 4 seconds

What is the think time?
Answer

We apply R = M/X0 – Z,

Z = M/X0 – R
Answer

We apply R = M/X0 – Z,

Z = M/X0 – R = 60/5 – 4 = 8 seconds
Problem 5

We have
M
= 50 users
 Z = 20 s
R = 4 s

What is the system throughput?
Answer

From R = M/X0 – Z, we have
X0 = (R + Z)/M
Hence X0 = (20 + 4)/50 = 0.48 tasks/s
Problem 6

A system
 Can
process up to 4 transactions/second
 Has 60 users
 User think time is 12 seconds

Can the system achieve a response time
of 2 seconds?
Answer

Applying R = M/X0 – Z, we compute a
lower bound for the response time

Rmin = M/X0,max – Z
Answer

Applying R = M/X0 – Z, we compute a
lower bound for the response time


Rmin = M/X0,max – Z = 60/4 – 12 = 3 seconds
Answer is no
Problem 7

Compute the response time of a system
knowing the following parameters
M
= 50 users
Z
= 15 s
 VCPU
SCPU = 200ms
 UCPU
= 50%
Answer

Since Xk = Uk /Sk and Xk = VkX0,
X0 = Uk /(VkSk)

The response time is then given by
R = M/X0 – Z
Answer

Let us compute first the throughput X0

Applying X0 = Uk/(VkSk)
X0 = 0.50/0.200 = 2.5 interactions/s

The response time is then
R = M/X0 – Z = 50/2.5 – 15 = 5 s
Simulation
Case Studies
A simple reminder

If interarrival times are
Independent identically distributed
(i. i. d.)
According to an exponential law
then the probability of having exactly n
arrivals during a fixed interval is distributed
according to a Poisson law
Explanation (II)

Assume that
 The
probability of one arrival during a small
interval Dt is lDt
 The probability of two arrivals during the same
small time interval is negligible
lDt
lDt
lDt
lDt
lDt
lDt
Explanation (I)

The probability of having exactly k arrivals
during n slots is
n
k
nk
 (lDt ) (1  lDt )
k 

What would happen if the number of time
intervals goes to infinity while their total
duration T = nDt remains constant
Explanation (III)

We rewrite the previous expression as
n!
lT k
lT n  k
( ) (1 
) 
k!(n  k )! n
n
n!
(lT )
lT n
lT  k
(
1

)
(
1

)
k
n (n  k )! k!
n
n
k
and compute separately the limits of its
four factors
Explanation (IV)
lim n
n!
n(n  1)...( n  k  1)

1
k
k
n (n  k )!
n
(lT )
remains unchanged
k!
lT n
 lT
lim n (1 
) e
n
lT  k
lim n (1 
) 1
n
k
Explanation (V)

We obtain the Poisson distribution
( l T )  lT
e
k!
k

The probability that there are no arrivals
in the same time interval T (or in any time
interval T) is
e
 lT
Explanation (VI)


This last expression is the probability that the
time interval between two consecutive arrivals
is greater than T
The probability that the time interval between
two consecutive arrivals is equal or lesser
than T is
1 e
 lT
which is the cdf of the exponential distribution
A final observation

Use the Poisson distribution to generate
number of arrivals during a time interval

Use the exponential distribution to
generate interarrival times
Linear Regression
Most important point

Compute a regression line

Compute regression coefficient
Example
x[i]
y[i]
1.5
3
2.4
4.8
3.5
9
4.4
11
5.1 13.7
Linear Regression

We have
 one
independent variable
 One dependent variable

We must find
Y =  + bX
 minimizing the sum of squares of errors
Si (yi -  - bxi)2
Formulas
b
 x  y 
n x   x 
ni xi yi 
i
i
2
2
i
  y  bx
i
i
i
i
i
Calculations (I)
Sum
Average
Squared Sum
x[i]
y[i]
x[i]^2 x[i]y[i]
1.5
3 2.25
4.5
2.4
4.8 5.76 11.52
3.5
9 12.25 31.5
4.4
11 19.36 48.4
5.1 13.7 26.01 69.87
16.9 41.5 65.63 165.8
3.38
8.3
285.6
Calculations (II)
Sum
Average
Squared Sum
n
beta
alpha
x[i]
y[i]
x[i]^2 x[i]y[i]
16.9 41.5 65.63 165.8
3.38
8.3
285.6
5
3
-1.84
Outcome
14
12
10
8
6
4
2
1
2
3
4
5
6
More notations
S xx


 (x  x)
1
 x  n  x 
 ( x  x ) ( y  y )
2
i
i
2
2
i
i
i
i
S xy

S yy
1
 i xi yi  i x i i y i
n
2
 i ( yi  y )
1
2
2
 i yi  i y i
n
i
i


i
i



More notations (II)

Solution can be rewritten
b
S xy
S xx
  y  bx
Coefficient of correlation
r


S xy
S xx S yy
bS xx
S xx
b
S yy
S xx S yy
r = 1 would indicate a perfect fit
 r = 0 would indicate no linear dependency

Calculations
Sum
Squared Sum
n
Sxx
Syy
Sxy
r
x[i]
y[i]
x[i]^2 y[i]^2 x[i]y[i]
16.9 41.5 65.63 421.7 165.8
285.6 1722
5
8.508
77.28
25.52
0.995