Download 8 Continuous Random Variables and Random Walk

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Central limit theorem wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
8
Continuous Random Variables and Random Walk
So far we discussed probability in discrete spaces because it is to easiest way to introduce probabilistic concepts by relating them to
enumerations in discrete sets. But many real life experiment returns a continuous quantity as the result of the observation which cannot
be adequately described in the discrete setup. Moreover, as the emergence of the normal distribution shows, sometimes the limit object
of discrete models naturally leads to a continuous model.
A continuous random variable X is a function from a probability space ⌦ to R taking on more than finitely or countably many values.
For example the exact height of a randomly chosen person takes on any real value between two reasonable limits.
The formal definition of a discrete and a continuous random variable is the same, but a difference appears when we try to describe
its distribution. A discrete random variable takes on values from a discrete subset {x1 , x2 , . . .} of R (say, integers), thus it makes sense
to talk about the probability P (X = xi ) that the random variable takes the value xi . It may happen that some of these probabilities is
zero, but there will be finitely (or at most countably many) values, where P (X = xi ) > 0 and their total probability adds up to 1:
1
X
P (X = xi ) = 1.
i=1
This was the basis to define the mass function f (xi ) = P (X = xi ). We cannot do the same for a continuous random variable, since
if the possible values x of X genuinely run through all real numbers (or just an interval), then the probability that X is exactly a fixed
given number x is zero:
P (X = x) = 0,
for any x.
(just think about it: the chance that a person is exactly 1.82 cm tall is zero).
It is therefore meaningless to ask for an exact value, but it does make sense to ask what is the probability that X is below a certain
threshold t 2 R, i.e. we can define
F (t) := P (X  t),
t2R
which is called the distribution function of X. Clearly F (t) is monotonically increasing and
lim F (t) = 0,
lim F (t) = 1,
t! 1
t!1
moreover, any function on R, monotonically increasing from 0 to 1 gives rise a continuous distribution.
The probability that X takes on values between two fixed numbers a, b is given by
P (a < X  b) = F (b)
F (a).
Very often the distribution
R 1 can also be characterized by its density function. This is the case if there exists a non-negative function
f : R ! R+ such that 1 f (x)dx = 1 and
Z b
P (a < X  b) =
f (x)dx.
a
More generally, for any set A ⇢ R we have
P (X 2 A) =
Z
f (x)dx
(15)
A
where the right hand side is the integral of f on the set A (if you have not seen the concept of integration on a general set, then think
of A consisting of many intervals). Clearly the density function and the distribution function are equivalent characterization, since
f (t) = F 0 (t) by the fundamental theorem of calculus. (We remark, however, that not every continuous distribution has a density
function, as not every monotone function F is differentiable).
While every random variable has distribution function, thus it seems to be a more general concept, many formulas are much nicer and
more intuitive if expressed via the density function. So whenever it is available, the density function is preferred over the distribution
function.
Expectation and variance for the continuous case In the discrete case, the expectation was defined by
E[X] =
X
x
x · P (X = x)
34
but this summation is meaningless for all real x’s. The correct continuous analogue is the integration, so we define
Z
E[X] :=
xf (x)dx
R
for any random variable with a density f (x). If a nonnegative random variable X
0 has no density, only distribution function F (t),
then we can still compute its expectation as
Z 1
Z 1
E[X] =
[1 F (t)]dt =
P (X > t)dt.
(16)
0
0
To see that these two definitions are equivalent, notice that
1
F (t) =
Z
1
f (x)dx,
t
therefore
Z
1
[1
F (t)]dt =
0
=
Z
1
Z
1
Z0 1 ⇣tZ
0
f (x)dxdt
x
0
Z
⌘
dt f (x)dx =
1
xf (x)dx
0
where we interchanged the two integrations (and adjusted the limits!).
Next, we compute the expectation of any function g(X) of a random variable X. We claim that
Z
E[g(X)] =
g(x)f (x)dx
R
if X has a density f (x). To see this, we may restrict the discussion to the case g
negative parts and repeat the argument for both separately). Using (16), we have
Z 1
E[g(X)] =
P (g(X) > t)dt
Z0 1 Z
=
f (x)dxdt
=
Z
0
x:g(x)>y
x:g(x)>0
Z
g(x)
dtf (x)dx =
0
(17)
0 (otherwise split the function g into its positive and
Z
1
g(x)f (x)dx
0
where we used (15) in the second line.
Finally the variance of a continuous random variable X is defined exactly in the same way as in the discrete setup:
V (X) := E[X
µ]2 = E[X 2 ]
where µ = E[X] is the expectation. Using (17), we can also write
Z
V (X) = (x µ)2 f (x)dx,
R
µ=
µ2
Z
R
xf (x)dx.
Some important continuous random variable The standard Gaussian (=normal) distribution has already been introduced via
its density function in (13). It is easy to check that it has zero expectation and variance one, i.e.
Z
Z
x2
x2
1
1
x p e 2 dx = 0,
x2 p e 2 dx = 0
2⇡
2⇡
R
R
Check this calculation! [Hint: for the first integral, think before compute; notice that the integrand is an antisymmetric function... For
2
d
the second integral you might use some integral table, mathematica or internet, but the intrinsic proof uses the fact that dx
e x /2 =
xe
x2 /2
and a smart integration by parts.]
35
By shift and scaling, we can easily construct a normal random variable with any given expectation µ and variance 2 . Clearly, if
X is a standard normal variable, then Y := X + µ has expectation µ and variance 2 [WHY?]. The density function of Y is given by
f
,µ (y)
= p
(y µ)2
2 2
1
e
2⇡
since
t
P (Y  t) = P (X 
=
Z
t
1
p
µ
1
e
2⇡
)=
Z
(t µ)/
(y µ)2
2 2
1
p e
2⇡
1
x2
2
dx
dy
after a change of variables y = x + µ. The more general form of the central limit theorem asserts that normalized sums of independent,
identically distributed random variables with mean µ and variance converges to the normal distribution fµ, . This fact justifies the
central role of the Gaussian distribution.
In some sense, the simplest continuous distribution is the uniform distribution on some finite interval [a, b]. The density function is
⇢ 1
if a  x  b,
b a
f (x) =
0
otherwise,
i.e. this distribution ”assigns equal weight” to any number in [a, b]. As an exercise, prove that the expectation and the variance of a
uniform random variable X on [a, b] are given by
E[X] =
a+b
,
2
V ar[X] =
(b
a)2
12
This clearly shows that the expectation expresses the ”middle” of the distribution, while the variance gives its ”width”.
The uniform distribution can be used to model quantities which are naturally confined between two values, but there is no specific
reason to favor one intermediate value over another one. For example, if your (trusted) friend tells you that he visits you between 7 and
8pm, then it is reasonable to model his arrival time by a uniform distribution on [7, 8].
The third important continuous distribution is the exponential distribution with parameter
and its density function is given by
⇢
e x if x 0
f (x) =
0
otherwise,
> 0. It takes on nonnegative values
(18)
The computation of the expectation and variance are again left as an exercise, here is the result:
E[X] =
1
,
V (X) =
1
2
for a random variable distributed by (18).
Exponential distribution often arises as the distribution of the amount of time from now until some specific event occurs if we do not
have a natural upper bound on this time. For example, the time until the next earthquake, the time until you receive your next spam
email etc. are well approximated by exponential distributions. The main reason is its ultimate connection with the Poisson distribution
as we now explain.
We may assume that earthquakes happen independently with a certain intensity > 0. This means that up to time t, the expected
number of earthquakes is t (this is obviously proportional with t and is the intensity parameter). Since earthquakes are rare events,
their appearance can be modeled by a Poisson random variable So if N (t) denotes the number of earthquakes up to time t, then it is
distributed by a Poisson random variable with parameter t (recall (14)), i.e.
P (N (t) = k) = e
36
t(
t)k
k!
is the probability that there are k earthquakes up to time t. In particular the time T of the first earthquake (from now) has a distribution
function
P (T  t) = 1 P (T > t) = 1 P (N (t) = 0) = 1 e t
which is exactly the distribution function (integral) of (18).
Apart from learning a few basic distributions and their typical appearance, notice another important lesson from these arguments. If
we want to model a random event (like waiting times or number of earthquakes), it is very useful to have a good idea which family
of distributions is reasonable to consider as an Ansatz. The precise parameter of the distribution can then be estimated by statistical
tests. In the example above, we came up with a very explicit model for the number of earthquakes. We used theoretical arguments
(”rare events”) to justify that the distribution is Poisson. The only open issue is to identify the parameter, but this could be done just by
considering its expectation value (which can be done by looking at statistical data for the number of earthquakes in the past).
Estimating the parameter(s) of an unknown distribution if the family of the distribution is known, is the realm of parametric statistics.
The task of proposing a good model based upon statistics much harder if we have no clue on the distribution. This is the goal of the
non-parametric statistics.
The Bernoulli trial process of repeatedly flipping a coin can be interpreted as a random walk on the 1-dimensional grid. We study
related questions and generalize the setting to two and higher dimensions.
Walking on the line. The 1-dimensional integer grid consists of all integers, Z. For every i
1, we have a random variable that
determines whether the i-th move is a step to the left or a step to the right:
P (Xi = 1) = P (Xi =
1) = 12 .
Starting at the origin, the position after n moves is
Sn = X1 + X2 + . . . + Xn ;
see Figure 16. For example, we could repeatedly flip a coin and count head as +1 and tail as
Sn = #heads #tails. The main questions we ask about the random walk are as follows:
1. With this interpretation, we have
1. What is the average distance of the random walker from the origin?
2. What is the probability that after n moves the walker is back at the origin?
3. More generally, what is the probability distribution of Sn ?
4. How often does the random walker return to the origin?
−3
−2
−1
0
1
2
3
Figure 16: The 1-dimensional integer grid and a path encoding the steps 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, . . . of a random walk.
Average distance. The expected value of Xi is zero, for every i. This implies
E(Sn ) = E(Xi ) + E(X2 ) + . . . + E(Xn ) = 0,
37
for all n
distance:
0. It is more difficult to compute the expected distance from the origin, but it is easy to compute the expected squared
E(Sn2 )
=E
n
X
Xi
i=1
=E
n X
n
X
!2 !
Xj Xk
j=1 k=1
=
n X
n
X
!
E (Xj Xk ) ,
j=1 k=1
which evaluates to n because E(Xj Xk ) = 0, unless j = k, in which case we have E(Xj Xk ) = 1. In words, the average squared
p
p
distance of Sn from the origin is n, suggesting that the average distance is about n. There are a constant times n integers at distance
p
at most n from the origin. We would therefore guess that the probability of the random walker to be back at the origin after n moves
p
is some constant times 1/ n.
Note that E(Sn2 ) is the variance of Sn , and we have learned that for independent random variables, the variance is additive:
V (Sn ) = V (X1 ) + V (X2 ) + . . . + V (Xn ).
The variance of Xi is 1, for each i, which gives the same result about the average squared distance from the origin.
Probability distribution. The position of the random walker is odd when n is odd. It follows that it can be at the origin only after
an even number of moves. We therefore limit ourselves to even numbers. The probability for the random walker to be at position 2j
after 2n moves is
!
2n
P (S2n = 2j) =
· 2 2n
n+j
=
(2n)!
(n + j)!(n
j)!
·2
2n
.
(19)
To better understand this expression, it is useful to have a good approximation. Here, we use
S TIRLING ’ S F ORMULA. n! ⇠
p
1
2⇡nn+ 2 e
n
,
where ‘⇠’ means that the limit of the ratio of the left over the right hand side, as n goes to infinity, is 1. Proving this formula is a
bit too much
R are willing to spend, but a weaker estimate is easy to get. Taking the logarithm of the factorial, we get
P effort than we
ln n! = n
ln
i.
Using
ln x dx = x ln x x, we get
i=1
[x ln x
[x ln x
x]n
1 = n ln n
x]n+1
1
n + 1,
= (n + 1) ln(n + 1)
n
as lower and upper bounds of ln n!, and therefore
nn · e
n+1
 n!  (n + 1)n+1 · e
n
.
Central limit theorem. We are interested in the limit distribution, when n goes to infinity. Not surprisingly, this is the Gaussian
normal distribution, defined as
1
f (x) = p
·e
2⇡
38
x2 /2
.
We can prove this starting from (19), in a sequence of transformations and approximations. To simplify the notation, we write P2j =
P (S2n = 2j).
1
(2n)2n+ 2
2n
1
1 · 2
2⇡(n + j)n+j+ 2 (n j)n j+ 2
r
(n2 )n
n
nj
n j
=
2
2
2
2
n
j
⇡(n
j ) (n
j ) (n + j) (n j)
✓
◆ n
r
n
j2
=
·
1
⇡(n2 j 2 )
n2
✓
◆ j ✓
◆j
j
j
· 1+
· 1
.
n
n
P2j = p
j
p
Setting j = 0, we get P (S2n = 0) ⇠ 1/ ⇡n, which is a refinement of the initial estimate we made. To further simplify the expression,
p
we set j = r n. Then
P2j =
s
1
⇡(n
r2 )
✓
✓
· 1
r2
n
◆ rpn ✓
r
· 1+ p
· 1
n
2
2
1
⇠ p
· er · e r · e
⇡n
2
1
= p
· e j /n .
⇡n
◆
n
r
p
n
r2
◆ r pn
p
p
For a < b, write P[a,b] for the limit, for n going to infinity, of the probability that 2j lies between a 2n and b 2n. We have
P[a,b] = lim
n!1
wherep
the sum is over all a  j
x = j 2/n to get
p
X
1
p
·e
⇡n
j 2 /n
2/n  b. Noting that this sum is a Riemann integral with intervals of length
P[a,b] = lim
n!1
=
Z
b
a
✓X
1
p
·e
⇡n
1
p
·e
2⇡
x2 /2
j 2 /n
(20)
,
p
2/n, we define
◆ r
2
/
n
dx.
This proves that the probability distribution in the limit is f , as claimed above. What we proved is just a special case of the Central
Limit Theorem, which holds more generally for sums of independent random variables drawn from a common distribution with finite
average and variance.
Returning to the origin. We now come back to the fourth question about 1-dimensional random walks. To count the number of
times the random walker returns to the origin, we add up the indicator function for that event:
Jn =
#Returns =
⇢
1
0
1
X
if Sn = 0,
otherwise,
J2n .
n=0
39
The expected number of visits to the origin can now be computed by adding the expectations of the indicator functions:
E(#Returns) =
=
⇠
1
X
n=0
1
X
n=0
1
X
n=0
E(J2n )
P (S2n = 0)
1
p ,
⇡n
which is infinity. In words, the random walker is expected to visit the origin infinitely often. It is even true that with probability 1, the
random walker returns to the origin infinitely often, but this does not follow from the infinite expectation. The gap in the argument results
from the possibility to have a random variable with finite values but infinite expectation. An example is the St. Petersburg Paradox. You
flip a coin until getting tail the first time. If you get k heads before the tail, your payoff is 2k+1 euros. The expected value is
E(payo↵) =
1 0 1 1 1 2
· 2 + · 2 + · 2 + ...
2
4
8
which is infinitely many euros. You should therefore be ready to pay any finite amount to play this game, which is apparently not true.
Herein lies the paradox.
Beyond one dimension. The four questions for a random walker in one dimension can also be asked for the d-dimensional integer
grid, Zd ; see Figure 17. Its elements are the d-vectors with integer coordinates. As before, we define
(0,1)
(0,0)
(1,0)
Figure 17: The integer grid in the plane. Every point is an integer combination of the two unit coordinate vectors.
Sn = X1 + X2 + . . . + Xn ,
where each Xi is ±1 times one of the d unit coordinate vectors. We have E(Xi ) = 0, for every i, and therefore E(Sn ) = 0, for every
n. We have Xi Xi = 1, for all i, where multiplication means the scalar product. This is a useful operation because Xi Xi = kXi k2 is
the squared distance of Xi from the origin. For i 6= j, we have
P (Xi Xj = 1) = P (Xi Xj =
1) =
1
,
2d
1
.
d
P (Xi Xj = 0) = 1
The expected value of Xi Xj is therefore 1, if i = j, and 0, if i 6= j. Hence,
! n
!!
n
X
X
E(Sn Sn ) = E
Xi
Xj
i=1
j=1
p
evaluates to n, as in one dimension. This suggests that the average distance to the origin in Zd is some constant times n. In contrast to
p
one dimension, there are many integer points at distance at most n from the origin, namely some constant times nd/2 . This makes a
difference when we calculate the expected number of times the random walker returns to the origin. Doing the computations rigorously
would take too much time, but we get qualitatively the correct result even when we take short-cuts. Specifically, for S2n to be 0, it must
40
be zero in each coordinate. For this, it is necessary that the walker takes an even number of steps in each coordinate. Assuming this
even number is 2n/d, for each coordinate, the probability that the walker is at 0 is
1
p
⇡n/d
!d
= const · n
d/2
.
Taking the sum, for n from 0 to 1, we get infinity, for d = 1, 2, and a finite value, for d 3. In words, there is a striking difference
between 2 and 3 dimensions, namely that it is much easier to find a place in 2 dimensions even if the search has neither the benefit of
knowledge nor of a strategy beyond randomness.
41