Download Chapter 11 - Sequences of Random Variables

Document related concepts
Transcript
EE603 Class Notes
12/05/13
John Stensby
Chapter 11: Sequences of Finite-Second-Moment Random Variables
The theory of sequences of finite-second-moment random variables is the topic of this
chapter. We study their application to system theory, where they serve as the system’s input and
output. It is natural to ask if a given random sequence has a limit, and in what sense is the limit
is approached. Convergence of random variable sequences is discussed in this chapter. This
chapter deals with discrete phenomenon and mathematics.
Random sequences occur in applications where analog signals are sampled. They have
applications in the fields of signal and image processing, digital control and digital
communications. They have many applications outside of electrical engineering (for example, in
the world of games, stocks, money and finance).
Sequence of Random Variables – A Basic Definition
Let (S, F, P) be a probability space (see Chapter 1 of these notes). A random variable
X() maps S into the extended real line R+. (See Chapter 2 for the definition of a random
variable.)
A mapping from a sample space into a set of discrete-time sample functions is called a
random, or stochastic, sequence X(n;), also known as a discrete-time random process. Often,
we suppress the argument  and write X(n). For each fixed  in some sample space S, the
function of n denoted by X(n;) is an “ordinary” deterministic sequence of numbers known as a
sample function. Alternatively, also true is the fact that X(n;) is a sequence of random variables
that is indexed on n. That is, for a fixed index n0, X(n0;) is a random variable.
Example 11-1: X(n;)  X()f(n), where X() is a random variable, and f(n) is a deterministic
sequence of real numbers, is a simple random sequence.
Example 11-2: X(n;)  A()sin(n/10 + ()), where A() and () are random variables, is a
random sequence.
These two elementary examples have the feature that their future values are predictable from
their present and past values.
Updates at http://www.ece.uah.edu/courses/ee385/
11-1
EE603 Class Notes
12/05/13
John Stensby
Repeated Bernoulli Trials - the Quintessential Example of an Infinite Random Sequence
Consider the tossing of a fair coin. Here, we have the sample space S = [H, T], the set of
events (i.e., -algebra) F = {[H], [T], [H,T], Ø}, and the probability measure P that is usually
associated with the tossing of a fair coin (i.e., P[H] = P[T] = 1/2, etc.). (S, F, P) is the
probability space for the coin tossing experiment. We define a random variable X: S  R as
X (H)  1
(11-1)
X (T)  0 .
We know that {X < 1/2} = [T], {X > 1/2} = [H], etc. In what follows, probability space (S, F, P)
and random variable X will be used “to build” the Bernoulli trials random sequence.
This random sequence X(n) is defined easily.
On the nth toss, assign X(n) = 1
(alternatively, X(n) = 0) if a heads (alternatively, tails) is obtained. We call X(n) the Bernoulli
trials random sequence. This simple sequence must be described by using the methodology
outlined above, a task that introduces some complexity. We need a probability space (S , F,
P) so that X(n;) can be defined as a mapping from S into a set of binary functions. The
space (S , F, P), the development of which is outlined below, is a product space.
Our product space (S , F, P) is developed by using ideas from Chapter 1 of these notes
(also see Chapter 8 of Stark and Woods, Probability and Random Processes, 4rd ed.). Instead of
considering individual heads and tails as elementary outcomes of separate experiments, our
product space has elementary outcomes that are infinite head/tail sequences. We define the
Bernoulli trials random sequence X(n;) as a mapping from sample space S into a set of binary
functions.
Our product space will be built as an infinite Cartesian product of (S, F, P) with itself
(recall that (S, F, P) describes the coin tossing experiment). But first, by (Sn, Fn, Pn), we denote
the “nth repetition” of (S, F, P); that is, Sn = {[Hn], [Tn]} and Fn = {[Hn], [Tn], [Hn, Tn], Ø}, where
Hn and Tn denotes “heads on the nth toss” and “tails on the nth toss”, respectively. Pn is the usual
Updates at http://www.ece.uah.edu/courses/ee385/
11-2
EE603 Class Notes
12/05/13
John Stensby
probably measure that is used for the tossing of a fair coin (i.e., Pn[Hn] = Pn[Tn] = 1/2, etc.).
Now, denoted as (S, F, P), our infinite-dimensional product space is determined from the (Sk,
Fk, Pk), k  1, as outlined in what follows.
The sample space S is the infinite Cartesian product

S   S k  S1  S2  S3    S n   .
(11-2)
k 1
Elements of S consist of infinite sequences of heads and tales. Element   S has the form
   1,   ,   ,   ,
(11-3)
where k  Sk is either heads or tails, 1  k <  ( is a sequence of heads/tails outcomes, not the
outcome of a specific trial).
F denotes the set of events (i.e., the -algebra) for the product space. F includes all sets
of the form

 A k  A1  A2 
k 1
  An   ,
(11-4)
where Ak  Fk, 1  k <  (set (11-4) is called a generalized rectangle). Also, all countable
intersections and unions of such sets are included in F. For example, consider the event [the
first two tosses produce different outcomes]  F . This event in F is represented as
{H1}  {T2}  S3  S4 


 {T1}  {H 2 }  S3  S 4    ,
(11-5)
the union of two generalized rectangles each of the form (11-4). Also, the intersection of events
Updates at http://www.ece.uah.edu/courses/ee385/
11-3
EE603 Class Notes
12/05/13
John Stensby
[{H1}{S2}{S3} … ] [{S1}{T2}{S3} … ] must mean the event [{H1}{ T2}{S3} …
]. As it turns out, F is the -algebra generated by the collection (i.e., set) of all generalized
rectangles of the form (11-4) (see Chapter 1 for details on how a -algebra can be generated by a
collection of sets).
To finish our product space, we must define P , a probability measure on the product
space. To accomplish this, we use the fact that the successive trials are independent, and
probabilities can be multiplied (without this assumption, it would not be possible to define P
without knowing the interdependence of each trial on the other trials). We start with events of
the form given by (11-4), and we define





P   A n 
 Pn (An )  P1(A1)P2 (A 2 )P3 (A3 )  Pn (A n ) 
 n 1

n 1

  
Algebraic Product
 Cartesian Product 
(11-6)
(note the different interpretation/usage of the  symbol). We realize that every event in F can
be represented as countable unions and/or intersections of events of the form (11-4). And, we
use the Axioms of Probability (specifically, the Countable Additivity property - possessed by all
valid probability measures) to extend definition (11-6) to all of F. For example, consider the
event [the first two tosses are different]  F given by (11-5). The probability of this event is
P the first two tosses are different 
 {T1}  {H 2 }  S3  S 4   

 P {H1}  {T2 }  S3  S 4  


 P {H1}  {T2 }  S3  S 4  

 
 P[H1 ] P [T2 ] P[S3 ] P [S 4 ] 

P  {T1}  {H 2 }  S3  S 4    



P[T1 ] P [H 2 ] P[S3 ] P [S 4 ] 
(11-7)
 1/ 4  1/ 4  1/ 2 ,
Updates at http://www.ece.uah.edu/courses/ee385/
11-4
EE603 Class Notes
12/05/13
John Stensby
where we have used the fact that the event [the first two tosses are different] can be represented
as the union of two events of the form (11-4). This finishes the definitions of P and our product
space (S, F, P). Note that we have developed the same product space that is discussed in
Chapter 8 of Stark and Woods, 4rd edition (also in the 3rd edition, Ch. 6). Finally, using our
infinite-dimensional product space (S, F, P), we are in a position to define the Bernoulli trials
random sequence. Denote an elementary outcome in S as . That is,  = (1, 2, ... )  S ,
where each k  Sk, k  1, is either a head or tail (so that  is an infinite indexed sequence of
heads and tails). We define the Bernoulli trials random sequence as
X (n ;  )  1,  n  H n
 0,
(11-8)
 n  Tn
a mapping from  S into the set of binary functions (remember that k is the kth component
of ).
The Limit of Nested Event Sequences
When dealing with an infinite sequence of random variables, we need to be able to define
the notion of a limit of an event sequence. In general, the limit of an event sequence is
somewhat complicated and abstract. Before considering the general case, we first consider the
important simple special case of nested sequences.
A nested decreasing sequence of events is a simple concept. The event sequence Ak, k 
1, is nested and decreasing if for each integer n  1 we have
A1  A2  …  An .
(11-9)
A convenient feature of such a sequence is that
Updates at http://www.ece.uah.edu/courses/ee385/
11-5
EE603 Class Notes
AN 
12/05/13
John Stensby
N
 Ak .
(11-10)
k 1
Like a bounded and monotone sequence of real numbers, all of which have a real-number
limit, a nested decreasing sequence of events has a well-defined limit event. As N   in
(11-10), we obtain A  a countable intersection of events. And, a countable intersection of
events is an event (recall that the set of events, a -algebra, is closed under countable unions and
intersections). So, the limit of (11-10) is well defined. Often, we write AN  A  , where A is the
limit.
In some applications, an event can be expressed as the limit of a nested decreasing
sequence of events, a sometimes-valuable representation. For example, let X(n), n  1, be a
sequence of random variables and consider
A   X (n)  5, n  1  X (1)  5  X (2)  5  X (3)  5    X (N)  5  
 limit AN
, (11-11)
N 
where
AN 
N
 {X(n)  5} .
(11-12)
n=1
Note that A1  A2  …  AN so that AN is a nested decreasing sequence that has the limit A 
{X(n) < 5, n  1}.
Similar results and statements can be made for a nested increasing sequence of events.
The sequence BN is a nested increasing event sequence if B1  B2  …  BN for all N.
Furthermore, we can write
Updates at http://www.ece.uah.edu/courses/ee385/
11-6
EE603 Class Notes
BN 
12/05/13
John Stensby
N
 Bn .
(11-13)
n 1
A nested increasing event sequence always has a limit
B  limit Bn
(11-14)
n 
since B can be written as a countable union of events. Often, we write BN  B  .
Nested sequences of events are special cases of general event sequences. In Appendix
11b, we define the limit, when it exists, of an arbitrary sequence of events (unlike the case of
nested sequences, the limit of an arbitrary event sequence may not exist!).
Concerning infinite intersections and unions, some standard notation needs to be
reviewed. For Bn, n  1, an arbitrary sequence of events, we utilize the standard notation

N
n=1

n=1
N
n=1
n=1
 Bn  Nlimit
 Bn

(11-15)
 Bn  Nlimit
 Bn .

(11-16)
Of course, the limits (11-15) and (11-16) may, or may not, exist when the Bn are non-nested.
Computing P[A], where A is the Limit of a Nested Event Sequence
We need to be able to compute probabilities like P[X(n) < 5, n  1]. This probability can,
we will argue, be computed as the limit of P[AN], where AN is represented by (11-12). That is,
we need to show the second equality in (the first equality is a definition)
N




N

P   {X (n) < 5}  P  limit  {X (n) < 5}  limit P   {X (n) < 5} .
 n=1

 N  n=1
 N   n=1

Updates at http://www.ece.uah.edu/courses/ee385/
(11-17)
11-7
EE603 Class Notes
12/05/13
John Stensby
To any specified accuracy, this limit can be approximated by using sufficiently large N.
The second equality in Equation (11-17) follows from the continuity of the probability
measure P, a fact that we will argue in what follows. The events
AN 
N
 {X(n) < 5}
(11-18)
n=1
form an indexed set of nested, decreasing events. The limit of the nested sequence is
A  limit AN  limit
N 
N
 {X(n) < 5} 
N  n=1

 {X(n) < 5} .
(11-19)
n=1
As will be shown in a section that follows, for the nested sequence of decreasing events, we have
P  A   P  limit AN   limit P  AN  .
 N 
 N 
(11-20)
That is, we can interchange P and the limit operations. A similar statement will be made for a
nested sequence of increasing events, BN  B.
Nested sequences are just special cases. In Appendix 11-B, we define what is meant by
the limit of an event sequence where the events are not generally nested. Also, we argue that
(11-20) is true for arbitrary convergent sequences of events.
Continuity of a Probability Measure
On a general probability space (S, F, P), the probability measure P has a continuity
property. This is satisfying from an intuitive sense; it allows us to use P as a metric, or “gauge”,
to “measure” the “size” of an event. Also, the continuity of P is used when we approximate the
probability of an event that is represented as the limit of an infinite sequence of nested events.
There is an analog here to the theory of continuous functions. Let f(x) be any function
Updates at http://www.ece.uah.edu/courses/ee385/
11-8
EE603 Class Notes
12/05/13
B2
B1
B3
John Stensby
B4

Figure 11-1: An increasing sequence of
events.
with domain that includes x0. Then f(x) is continuous at x0 if and only if
limit f ( xn )  f (limit xn )  f ( x0 )
n 
n 
(11-21)
for all sequences {xn} that converge to x0. In words, Equation (11-21) states that one can
interchange limit and function computation. In the sense described by Theorem 11-1 (and the
more inclusive results given in Appendix 11B), this basic idea carries over to probability
measures.
Theorem 11-1: Consider an increasing sequence of events as shown by Figure 11-1. That is, the
events are such that Bn  Bn+1 for all n  1. Define the infinite union of these events as
B  limit BN  limit
N 
N
 Bn 
N  n=1

 Bn ,
(11-22)
n=1
a well-defined event (since a -algebra is closed under countable unions). Then, to any degree
of accuracy that is required, P[B] can be approximated by P[Bn] for sufficiently large n. That
is, we have


limit P[ Bn ]  P  limit Bn   P[ B ] .
n 
 n  
Updates at http://www.ece.uah.edu/courses/ee385/
(11-23)
11-9
EE603 Class Notes
12/05/13
John Stensby
In words, (11-23) says that we can move the limit operation from “outside” to “inside” the
probability measure (interchange the limit and probability operations).
Proof: We define the sequence of events
A1  B1
A 2  B2  B


(11-24)
An  Bn  Bn 1


where the over-bar denotes set complement. The Bn are nested, and the disjoint An, 1  n  N,
“union up” to BN so we can write
BN 
N
N
n 1
n 1
 Bn   An ,
1  N   (i.e., including N  ) .
(11-25)
As a result of this, we have
N

N
 N
P  BN   P   Bn   P   An    P[ An ]
 n 1 
 n 1  n 1
(11-26)
for all finite N. Now take the limit of (11-26) to obtain
limit P  BN   limit
N 
N
 P[ An ] 
N  n 1

 P[ An ] .
(11-27)
n 1
Now, the most crucial step in the proof answers the question: does the sum on the right-hand
Updates at http://www.ece.uah.edu/courses/ee385/
11-10
EE603 Class Notes
12/05/13
John Stensby
side of (11-27) converge? If yes, what does it converge to? Since the An are disjoint, we can use
the Countable Additivity Property of P (see Chapter 1) to write
limit P  BN  
N 



 P[ An ] P   An   1 .
n 1
 n =1 

(11-28)
In (11-27), the middle Nth partial sum is an increasing sequence of real numbers that is bounded
above by unity, as can be seen by (11-28). Hence, the limits in (11-27) and (11-28) converge.
To find out what they converge to, simply use


n =1
n =1
 An   Bn  B ,
(11-29)
in (11-28) to obtain the desired result
limit P  Bn   P  B   P  limit Bn  . 
n 
 n  
(11-30)
Corollary 11-1: A version of Theorem 11-1 holds for a decreasing nested sequence of events.
That is, suppose Bn  Bn+1 for n  1. Then we can write


P  B   P  limit Bn   limit P[ Bn ] ,
 n 
 n 
(11-31)
where
BN 
N
 Bn , B 
n 1

 Bn .
(11-32)
n 1
Updates at http://www.ece.uah.edu/courses/ee385/
11-11
EE603 Class Notes
12/05/13
John Stensby
Proof: Similar to the proof given for Theorem 11-1.
Appendix 11B extends Theorem 11-1 to more general, non-nested sequences of events.
In the appendix, we define the limit, if it exists, of an event sequence, not necessarily nested. If
event A is the limit of an infinite event sequence An , we show that P[A] is the limit of P[An] as
index n approaches infinity. So, the probability measure P is continuous!! The analogy, drawn
in the paragraph preceding Theorem 11-1, to continuous function is valid!
Example 11-3: Theorem 11-1 and its corollary are used to approximate the probability of an
event that is represented as the limit of an infinite sequence of events. For example, for each n 
0, let Bn = {X[k] < 2 for 0  k  n}. This is a decreasing and nested sequence of events: Bn+1 
Bn, n  0. Suppose we wanted to calculate P[B], where B = {X[k] < 2 for 0  k}. We know
that
B 

 Bn .
(11-33)
n=0
We use the corollary to approximate (as closely as desired) P[B] as the probability of a finite
intersection. That is, based on our accuracy requirements, we select N and approximate

N
P[ B ]  P[  Bn ]  P[  Bn ]  P  X(0)  2, X(1)  2, , X(N)  2 .
n=0
(11-34)
n=0
Example 11-4: Back in Chapter 2 of these class notes, we were told that probability distribution
functions are right continuous. We were told that
F( x )  limit F(x +1/n)
n 
(11-35)
for any distribution function F(x) and all x. However, Equation (11-35) follows directly from
Updates at http://www.ece.uah.edu/courses/ee385/
11-12
EE603 Class Notes
12/05/13
John Stensby
Theorem 11.1 since
limit F(x +1/n) = limit P  X  x  1/ n 
n 
n 


 P  limit{X  x  1/ n}
 n 

(11-36)
 P {X  x}
 F(x).
Statistical Specification of a Random Sequence
In this chapter, we assume that all random variables are real-valued. This assumption
greatly simplifies the notation, definitions and theory. From a conceptual standpoint, little is lost
by assuming that everything is real valued (however, complex-valued random sequences are
important - and often used - in many applications where band-pass signals are represented by
their complex-valued, low-pass equivalents).
A random sequence is statistically specified by its distribution functions, all orders are
required in general. That is, for each positive integer n, and for all positive integer sequences k1
 k2  …  kn, we need knowledge of the nth-order distribution function
F( xk1 , xk 2 ,, xk n ; k1, k 2 ,..., k n )
(11-37)
 P  X[k1 ]  xk1 , X[k 2 ]  xk 2 , , X[k n ]  xk n  .
Note that a complete statistical specification requires an infinite set of distribution functions. In
(11-37), the algebraic variables xk1 , xk 2 , , xk n are called realization variables. The subscripts
on these variables serve only to distinguish one variable from another; F(, ,  ; k1, k2, k3) is
just as meaningful as F( xk1 , xk 2 , xk3 ; k1, k 2 , k 3 ) .
Updates at http://www.ece.uah.edu/courses/ee385/
11-13
EE603 Class Notes
12/05/13
John Stensby
The probability density functions are obtained by differentiating distribution functions.
That is, the nth-order probability density function is defined as
f ( xk1 , xk 2 , , xk n ; k1, k 2 ,..., k n )
n

F( xk1 , xk 2 , , xk n ; k1, k 2 ,..., k n )
xk1 xk 2  xk n
.
(11-38)
The moments of a random sequence are important in applications. The mean (sometimes
called the first-order average) is defined as
[n]  E  X(n)  


(11-39)
x f (x, n)dx
for a sequence of continuous random variables.
Second-order statistical averages appear often in practice.
For example, the
autocorrelation function is defined as
R X (n, m)  E  X(n)X(m)   


x x f ( xn ,xm ; n, m)dxn dxm .
  n m
(11-40)
In a similar manner, the autocovariance function is defined as
CX (n, m)  E {X(n)  (n)}{X(m)  (m)}




 
.
(11-41)
{xn - (n)}{xm  (m)} f ( xn ,xm ; n, m)dxn dxm
Note that both RX and CX are symmetric
Updates at http://www.ece.uah.edu/courses/ee385/
11-14
EE603 Class Notes
12/05/13
R X (n, m)  E  X(n)X(m)   E  X(m)X(n) 
John Stensby
(11-42)
 R X (m, n)
CX (n, m)  E {X(n)  (n )}{X(m)  (m )}
 E {X(m)  (m )}{X(n)  (n )}
(11-43)
 CX (m, n).
Also, we can write
CX (n, m)  R X (n, m)  (n)(m) .
(11-44)
The sequence X(n) is said to have uncorrelated elements (or to be uncorrelated) if
R X (n, m)  E  X(n)X(m)  E  X(n)] E[X(m)  (n)(m), n  m .
For such a sequence, (11-44) leads to the conclusion that
CX (n, m)  R X (n, m)  (n)(m)  2 (n)
0
nm
(11-45)
nm
where 2(n) denotes the sequence variance.
Example 11-5: Many applications involve the arrival of objects. For example, we may be
interested in the arrival of cars at an intersection, the arrival of electrons at the plate of a vacuum
tube, etc. A commonly-used simplifying assumption is that the objects arrive independently of
one another. Let (n) denote the interval of time (in seconds) between the arrival of the (n-1)th
and nth objects (relative to a given initial time 0, (1) is the arrival time for the first object). The
Updates at http://www.ece.uah.edu/courses/ee385/
11-15
EE603 Class Notes
12/05/13
John Stensby
time line is depicted by Fig. 11-2 below. For n  1, we assume that (n) is a sequence of
identical, independent random variables each with the exponential density
f  (t ; n)   exp[ t]U(t) .
(11-46)
The mean of (n) is

  (n)  E[ (n)]   x  e  x dx  1/  ,
(11-47)
0
and its variance is

2(n)  E[ (n)2 ]  (1/  )2   x2 e  x dx  (1/  )2  2 /  2  1/  2
0
.
(11-48)
 1/  2
Relative to a given initial time 0, the running sum of these intervals is the arrival times of
the objects . That is, the arrival time of the nth object is
T(n) 
n
 (k) ,
(11-49)
k 1
(1)

0
(2)

(3)

(4)




T(1)
T(2)
T(3)
T(4)
Fig. 11-2: Random arrival times. (n) is the time between arrivals, and T(n) is the
actual arrival time (relative to origin 0).
Updates at http://www.ece.uah.edu/courses/ee385/
11-16
EE603 Class Notes
12/05/13
John Stensby
a sequence of random variables indexed on n. Since the time intervals are independent, the
density function fT(t;n) for T(n) is an n-1 fold convolution of (11-46) with itself. We claim that
this first-order density function is
fT (t; n) 
( t) n 1
 exp(- t)U(t) .
(n  1)!
(11-50)
This result can be established by induction (by using a different approach, this same result was
derived in Appendix 9B). Clearly, the result is correct for n = 1; assume it is true for n-1. Now,
we convolve again to obtain
f T (t ; n)  f T (t ; n -1)   exp( t)U(t) 
 t

({t  }) n  2
    exp( )
exp( {t  })d U(t)
(n  2)!
 0



t  n-2
d  U(t)
  nexp( t) 
0 (n  2)!


(11-51)


t n 1
exp( t)  U(t)
  n
 (n  1)!

as claimed. Equation (11-51) is the Erlang density, and T(n) is an Erlang distributed random
variable (this same result was obtained in Appendix 9B). The expected value of random variable
T(n) is
T (n)  n  (n)  n /  .
(11-52)
Since the interval random variables are independent, the variance of T(n) is
Updates at http://www.ece.uah.edu/courses/ee385/
11-17
EE603 Class Notes
12/05/13
T2 (n)  n2 (n)  n /  2 .
John Stensby
(11-53)
Gaussian Random Sequences
A random sequence X(n) is called a Gaussian random sequence if all its nth-order
probability density functions are Gaussian. Such sequences are very popular. Because of the
Central Limit Theorem, Gaussian sequences occur in many applications.
Also, they are
completely described by only first- and second-order statistical averages (i.e., means and
covariances). Finally, use of Gaussian statistics simplifies many technical developments and
makes mathematically tractable many problems in the areas of filtering, estimation, detection and
control.
Example 11-6: Let X(n) be a zero-mean Gaussian sequence; that is, E[X(n)] = 0 for all n. Also,
let X be delta correlated; that is, RX(n,m) = E[X(n)X(m)] = 2(n-m), where 2 is the variance
and
1, k  0

(k)  
.
0, k  0
(11-54)
Often, delta-correlated sequences are said to be white; in many applications, delta-correlated
Gaussian sequences are called white Gaussian noise. For n  m, X(n) and X(m) are uncorrelated
and, since they are Gaussian, independent. As a result, an Nth-order density function factors into
a product of N first-order density functions.
Most computer-based math packages (such as MatLab, Matcad, etc.) generate periodic
sequences that, for many purposes, can be used to approximate white Gaussian noise. For these
sequences, the correlation between elements can be very low, and the sequence period is very
long relative to the number of sequence values that are needed.
Independent Increments
Random sequence X(n) is said to have independent increments if for all N > 1 and n1 
Updates at http://www.ece.uah.edu/courses/ee385/
11-18
EE603 Class Notes
12/05/13
John Stensby
n2  ...  nN the process increments X(n1), X(n2) - X(n1), X(n3) - X(n2), ... , X(nN) - X(nN-1) are
jointly independent. Such processes have the nice feature that Nth-order density and distribution
functions can be “built up” as products of the densities of the individual increments. For
example, the second order distribution, for the case n2 > n1, can be written as
F(x1, x 2 ; n1, n 2 )  P  X(n1 )  x1, X(n 2 )  x 2 
 P  X(n1 )  x1, X(n 2 )  X(n1 )  x 2  x1 
(11-55)
 P  X(n1 )  x1  P  X(n 2 )  X(n1 )  x 2  x1 .
We have seen independent increment processes in previous chapters. For example, the
Random Walk process, introduced in Chapter 6, has independent increments.
Stationarity
Often, random sequences are generated by a mechanism that is not changing with time.
In these cases, the sequence moments are constant. More precisely, a random sequence is said to
be stationary if, for all positive integers N, the Nth-order density function of the sequence is
invariant to any shift of the index. That is, stationarity requires
f ( xn1 , xn 2 , , xn N ; n1, n 2 , ..., n N )
(11-56)
 f ( xn1 , xn 2 , , xn N ; n1  n0 , n 2  n0 , ..., n N  n 0 )
for all orders N and index shift values n0.
Example 11-5 introduces a random sequence (n) of interval times. Since the interval
times are independent, Nth-order densities can be built up as products of first-order densities,
each of a form given by (11-46). Clearly, the random sequence (n) of interval times is
described by an Nth-order density that satisfies (11-56); the sequence (n) is stationary. On the
other hand, the total waiting time to the nth arrival, the sum T(n) given by (11-49), is not
Updates at http://www.ece.uah.edu/courses/ee385/
11-19
EE603 Class Notes
12/05/13
John Stensby
stationary as is obvious by inspection of (11-52) and (11-53).
Wide-Sense Stationarity (WSS)
A weaker form of stationarity is adequate in some applications. Sometimes, all that is
required is “stationarity in all second-order statistics”. We say that a random sequence is wide
sense stationary (WSS) if its mean function is constant and its covariance depends only on the
time difference. That is, the sequence is WSS if
(n)  (0)
(11-57)
R X (n, m)  R X (n - m)  R X (k) ,
(11-58)
where k  n - m is the time difference between the two sequence values. Clearly, all stationary
sequences are WSS.
However, the converse is not true.
Gaussian sequences provide an
interesting example for which there is no difference between the two forms of stationarity.
Two distinct sequences can have “mutual stationarity” properties. Wide sense stationary
sequences X(n) and Y(m) are said to be jointly wide sense stationary if
R xy [n, m]  E  X(n)Y(m)  R xy [n  m] .
(11-59)
That is, the cross correlation depends only on the time difference n-m.
Suppose X(n) and Y(m) are jointly WSS so that Rxy[n-m] = E[X(n)Y(m)]. We define k 
n-m and write
R xy [k]  E  X(m  k)Y(m) .
(11-60)
That is, for Rxy[k], k denotes the shift applied to the first indexed sequence (i.e., X(m)). Note
that Rxy[k]  Ryx[k], in general. However, note that Rxy[k] = E[X(m+k)Y(m)] = E[X(m)Y(m-k)]
Updates at http://www.ece.uah.edu/courses/ee385/
11-20
EE603 Class Notes
12/05/13
John Stensby
= Ryx[-k].
Power Spectral Density
Let X(n) be a real-valued, wide-sense-stationary sequence with finite average power E[X(n)2] <
. Denote the autocorrelation of X as Rx(k) = E[X(n+k)X(n)]. The power spectrum (or power
spectral density) is denoted as Sx(). The celebrated Wiener-Khinchine theorem states that the
power spectrum and autocorrelation comprise a discrete-time Fourier transform (DTFT) pair.
That is, we write
S x ()  F  R x (k)  
R x (k) 


k 
R x (k) e jk ,
    
(11-61)
1 
S x () e jk d .


2
Actually, Sx is 2 periodic in  and only need be specified on – <   . The average power in
X(n) can be expressed as
1 
S ()d
Avg. Pwr  E  X 2 (n)  

 2   x
(11-62)
watts.
White Noise Sequence
A zero-mean X(n) is said to be a white noise sequence if
R x (k)  E[X(n  k)X(n)]  2(k) .
(11-63)
Note that 2 is the finite variance of X(n). The power spectral density of X is
S x ()  F  R x (k)  


2 (k) e  jk   2
     .
(11-64)
k 
Updates at http://www.ece.uah.edu/courses/ee385/
11-21
EE603 Class Notes
12/05/13
John Stensby
The average power in X is
1  2
Avg. Pwr  E  X 2 (n)  
 d   2 watts.

 2  
(11-65)
Note that a discrete-time white sequence has a finite average power (contrast this with the
continuous-time case discussed in Chapter 8).
Systems
We are interested in systems with random sequence inputs. First, we review some basic
definitions involving systems. Then, we focus on determining the mean and autocorrelation of
the output of a linear system given descriptions of the input process and system impulse
response.
Given input sequence X(n,), we denote the system output as
Y(n,  )  L  X(n, ) ,
(11-66)
where operator L[·] maps input X into output Y. Often, we do not explicitly write the  variable
in the notation; we write Y(n) = L[X(n)] with the  implied.
The system is said to be linear if
L X1(n) + X 2 (n)  L  X1(n)  L  X 2 (n)
(11-67)
for all inputs X1, X2 and all constants , . Linear system can be described by an unit sample
response, denoted as h(n,m), assumed to be real-valued in what follows. This function is the
response at time n to an unit sample function applied at time m. In general, impulse response
h(n,m) may, or may not, depend on the absolute values of indices n and m, and h(n,m) may, or
may not, be nonzero for values of n less than m. Given input X(n) and impulse response
Updates at http://www.ece.uah.edu/courses/ee385/
11-22
EE603 Class Notes
12/05/13
John Stensby
h(n,m), we can express the output as
Y(n) = L  X(n)  =


h(n, )X( ) .
(11-68)
 = 
A linear system is said to be shift invariant (or time invariant) if a simple delay in the
input sequence produces a corresponding delay in the output sequence. More formally, we say
that linear system L[·] is shift invariant if
Y(n) = L  X(n)   Y(n - n 0 ) = L  X(n - n 0 ) 
(11-69)
for all input/output pairs (X, Y) and all index shifts n0. Shift invariant systems depend only on
the difference of n and , not their absolute values. In this case, we can write h(n,) = h(n - ).
Also, for shift invariant systems, Equation (11-68) becomes
Y(n) = L  X(n)  =


h(n - )X( ) = h * X ,
(11-70)
 = 
the convolution of input X with impulse response h.
A system is said to be bounded input - bounded output (BIBO) stable if bounded input
sequences produce bounded output sequences. A linear, shift-invariant system is BIBO stable if,
and only if, its impulse response is absolutely summable; that is, BIBO stability is equivalent to


h(n)   .
(11-71)
n= 
A linear, shift-invariant system is said to be causal if it does not respond before it is
Updates at http://www.ece.uah.edu/courses/ee385/
11-23
EE603 Class Notes
12/05/13
John Stensby
excited. More explicitly, for a causal system, if two inputs X1(n) and X2(n) are equal up to some
index n0, then the corresponding outputs Y1(n) = L[X1(n)] and Y2(n) = L[X2(n)] must be equal up
to index n0; what happens to the inputs after index n0 in no way influences the outputs before
index n0. One can show that a linear, shift-invariant system is causal if, and only if, h(n) = 0 for
n < 0. For a linear, shift-invariant and causal system, the input-output relationship becomes
Y(n) = L  X(n) =
n

h(n  )X( ) .
(11-72)
 = 
One should consider the differences between (11-68), the most general I/O formula, (11-70) for
the shift-invariant case and (11-72) which describes the most restrictive case.
Linear, shift-invariant systems can be analyzed in the frequency domain.
For this
purpose, we describe the Fourier Transform of signal X(k) as
X F (e j ) 


X(k) e  jk
(11-73)
k=-
(we will use a subscript of F to denote a Fourier transform). If (11-73) converges, XF is a
continuous, 2-periodic function of frequency variable . The inverse Fourier transform is
X(n) =
1 
X F (e j ) e jnd .

2 
(11-74)
The Fourier transform of a linear, shift-invariant system's output can be found easily.
Simply use the convolution theorem with (11-70) to obtain
YF (e j )  F [h(n)  X(n)]  H F (e j )X F (e j ) .
Updates at http://www.ece.uah.edu/courses/ee385/
(11-75)
11-24
EE603 Class Notes
12/05/13
John Stensby
Systems With Random Inputs
Given a system with a random input, we determine below the mean and autocorrelation
of the output. A more general, difficult problem is to find the Nth-order density function that
describes the system output. A linear system with a Gaussian input will have a Gaussian output.
Unfortunately, a general statement of this scope cannot be made for nonlinear systems or
systems driven by non-Gaussian inputs.
Theorem 11-2: Consider the linear system with input X(n) and output Y(n) = L[X(n)] (we do
not require shift-invariance or causality). Suppose that both X(n) = E[X(n)] and Y(n) =
E[Y(n)] exist. For this case, we can write
 Y (n) = E  Y(n)   E  L[X(k)]  L  E[X(k)]  L  X (n)  .
(11-76)
That is, it is possible to interchange the operations of L[·] and E[·]. We write
 

E  Y(n) = E   h(n, m)X(m)  .
 m 

(11-77)
Then, we formally interchange the summation and expectation to obtain

 

 Y (n) = E  Y(n) = E   h(n, m)X(m)    h(n, m)E[X(m)]
 m 
 m= 



m 
,
(11-78)
h(n, m) X (m)
and (11-76) is established.
Note that our derivation of (11-76) is not rigorous. A potential problem with (11-78) is
the formal interchange of expectation and summation. In cases where the mean of Y(n) does not
Updates at http://www.ece.uah.edu/courses/ee385/
11-25
EE603 Class Notes
12/05/13
John Stensby
exist, this interchange is not valid (can you construct a simple example where the mean of Y(n)
does not exist, i.e., the interchange in (11-78) is not valid?). We will consider this “interchange
problem” again once we have studied some convergence concepts.
Let's consider a special case of (11-78); suppose that input X(n) is wide-sense stationary
and the system is shift invariant. Then, h(n,m) = h(n - m) and X(n) = X is a constant so that
 Y (n) 

 



 h(n - m) X   h(m) X   H(e j )  X ,

m 
 m= 

(11-79)
so Y(n) = Y is a constant as well. The bracketed quantity on the right-hand side of (11-79) is
the DC gain of the system (which we assume to be finite in the development of (11-79)).
Example 11-7: Consider a low-pass filter with impulse response h(n) = nU(n), where 0 <  < 1
to insure stability. The Fourier transform of h is
H(e j ) 
1
1  e  j
(11-80)
According to (11-79), the mean of the filter output is X/(1 - ).
Next, we determine the cross correlation between a system input X(n) and its output
Y(n), both input and output assumed to be real valued. This quantity is defined as
R XY (n, m)  E  X(n)Y(m)
(11-81)
Then, we use this result to find the autocorrelation RY of the system output.
Theorem 11-3: Let X(n) and Y(n) denote the input and output, respectively, of a linear operator
L[·]; that is, Y(n) = L[X(n)]. The cross-correlation between the input X and output Y can be
calculated by the formula
Updates at http://www.ece.uah.edu/courses/ee385/
11-26
EE603 Class Notes
12/05/13
R XY (n, m) = L2  R X (n, m) ,
John Stensby
(11-82)
where L2 signifies that L operate with respect to the second variable (i.e., “m” is the independent
variable in the operation), treating the first variable (i.e., “n”) as a constant. In a similar manner,
the autocorrelation of the output can be calculated by the formula
R Y (n, m) = L1  R XY (n, m) ,
(11-83)
where L1 signifies that L operate with respect to the first variable only (i.e., “n” is the
independent variable in the operation).
Proof (see Theorems 7-1 and 7-2 for continuous-time version of this result): First, we write
X(n)Y(m) = X(n)L[X(m)] = L2 [X(n)X(m)] ,
(11-84)
where L2 operates on X(m). Now, take the expected value of this result to obtain
E[X(n)Y(m)] = E  L2 [X(n)X(m)] = L2  E[X(n)X(m)] = L2  R X (n, m) ,
(11-85)
and this establishes (11-82). The formula for the autocorrelation of the output can be developed
by taking the expectation of the product Y(n)Y(m) to obtain
R Y (n, m) = E  Y(n)Y(m) = E  L[X(n)]Y(m) = E  L1[X(n)Y(m)]
= L1  E[X(n)Y(m)]
(11-86)
= L1  R XY (n, m) ,
(L1 operates on functions of n) and this establishes (11-83) so that the theorem is established. 
Updates at http://www.ece.uah.edu/courses/ee385/
11-27
EE603 Class Notes
12/05/13
John Stensby
Note that Theorem 11-3 does not require that operator L (i.e., linear system) be time invariant or
that the input be wide-sense stationary.
Let us consider Theorem 11-3 specialized to the case of a WSS input sequence X(n) and
a shift-invariant, linear system described by unit sample function h(n). For this case, formula
(11-82) yields

R XY (n, m) =

 =-
R X (n, m - )h()
(11-87)

=

 =-
R X ([n - m] + )h() =


 =-
R X ([n - m] - )h(-) .
Observe that the right-hand side of (11-87) depends on n, m only through the difference k  n-m.
Hence, X and Y are jointly wide sense stationary, and we can write
R XY (k) = R X (k) * h(-k) .
(11-88)
For the WSS case, the output correlation formula (11-86) becomes

R Y (n, m) =


R XY ({n - }- m)h() =



R XY ({n - m}- )h() ,
(11-89)
a formula depending on k  n-m. Hence, we write
R Y (k) = R XY (k)  h(k) .
(11-90)
Finally, combining (11-88) and (11-90) yields
Updates at http://www.ece.uah.edu/courses/ee385/
11-28
EE603 Class Notes
12/05/13
R Y (k) = R X (k)  h(-k)  h(k) = R X (k)  {h(-k)  h(k)} ,
John Stensby
(11-91)
and we see that a WSS input produces an WSS output.
Example 11-8: Suppose output Y is related to input X by the simple relationship
Y(n) = L[X(n)] = X(n) - X(n -1) ,
(11-92)
the first-order, backwards difference. For example, sequence Y(n) might be subjected to a
threshold to implement a “pulse detector” function. The mean of the output is
E[Y(n)] = E[X(n)] - E[X(n -1)] =  X (n) -  X (n -1) .
(11-93)
The cross-correlation between input and output is
R XY (n, m) = L2  R X (n, m) = R X (n, m) - R X (n, m -1) .
(11-94)
Finally, the autocorrelation of the output is
R Y (n, m) = L1  R XY (n, m)  = R XY (n, m) - R XY (n -1, m)
= R X (n, m) - R X (n, m -1) - { R X (n -1, m) - R X (n -1, m -1) }
(11-95)
= R X (n, m) - R X (n -1, m) - R X (n, m -1) + R X (n -1, m -1) .
Suppose the input is WSS with autocorrelation
R X (n,m) = a
n-m
, 0 < a < 1.
Updates at http://www.ece.uah.edu/courses/ee385/
(11-96)
11-29
EE603 Class Notes
12/05/13
John Stensby
Then Equations (11-93) and (11-94) yield
Y = 0
RXY(n,m) = an-m - an-m+1 ,
(11-97)
and (11-95) yields
RY(n,m) = 2an-m - an-1-m -an-m+1.
(11-98)
The output sequence Y(n) is WSS; if k  n – m, then (11-96) and (11-98) become
RX(k) = ak
(11-99)
RY(k) = 2ak - ak-1 -ak+1,
(11-100)
respectively. A comparison of Fig. 11-3 and Fig. 11-4 (both correlations were computed and
plotted for a = .6) reveals that the “pulse detector” (11-92) “decorrelates” the input data X(n), at
Rx(k)
Fig. 11-3: Eqn. (11-99) with a = .6.
Updates at http://www.ece.uah.edu/courses/ee385/
Ry(k)
Fig. 11-4: Eqn. (11-100) with a = .6.
11-30
EE603 Class Notes
12/05/13
John Stensby
least to some extent.
Vector Space of Random Variables
All real-valued random variables with finite second moments (i.e., finite average power)
comprise a vector space over the field of real numbers. We define vector space L2 as


L2  X : E[ X 2 ]   ,
(11-101)
all real-valued, finite-second-moment random variables. We take the real number field, denoted
here by R, as our scalar field. To show that L2 is a valid vector space, we must show, among
other things, that L2 is closed under vector addition (i.e., X  L2 and Y  L2 implies that X + Y
 L2) and scalar multiplication (i.e., X  L2 and c  R implies that cX  L2).
The fact that L2 is closed under scalar multiplication follows easily. Clearly if X  L2
and c  R then E[cX2] = c2E[X2] <  so cX  L2.
The fact that L2 is closed under vector addition follows from use of the Schwarz
inequality (sometimes called the Cauchy-Schwarz inequality).
Theorem 11-4 (Schwarz): Let X  L2 and Y  L2. Then
E[XY]
2
2
2
 E[ X ] E[ Y ] .
(11-102)
Proof: Let  be any real-valued number and consider
2
2
2
2
E[ X  Y ]  E[ Y ]   E[XY]   E[XY]   E[ X ]  0 .
(11-103)
Now, Equation (11-103) is a quadratic equation in , and the roots are either complex-valued or
real and equal (see Fig. 5). Hence, in the quadratic equation, the discriminant must be nonpositive, or
Updates at http://www.ece.uah.edu/courses/ee385/
11-31
EE603 Class Notes
12/05/13
John Stensby
E[ Y ] 2  2E[XY]   E[ X ]
2
2
 -axis
Fig. 5: Graph of quadratic equation
2
2
2
4 E[XY]  4E[ X ]E[ Y ]  0 .
(11-104)
The Schwarz inequality follows directly from (11-104). In (11-102), equality results when Y is a
scalar multiple of X.
Now, we show that L2 is closed under vector addition. Let X  L2 and Y  L2 and
consider the sum X + Y. The second moment of the sum satisfies
2
2
2
E[ X  Y ]  E[ X ]  2E[XY]  E[ Y ]
.
2
2
2
(11-105)
2
 E[ X ]  2 E[ X ] E[ Y ]  E[ Y ]
However, all quantities on the right-hand-side of (11-105) are finite since X  L2 and Y  L2.
Hence, the sum X+Y  L2, and L2 is closed under vector addition. Closure under vector addition
and scalar multiplication is necessary for L2 to be a valid vector space.
The remaining
requirements (found in any elementary text on linear algebra) that L2 must satisfy are shown
easily. Hence, we can consider the set of all real-valued random variables with finite second
moments to be a valid vector space.
Updates at http://www.ece.uah.edu/courses/ee385/
11-32
EE603 Class Notes
12/05/13
John Stensby
Equality of Random Variables
Let X and Y be random variables. The statement X = Y can be interpreted in different
ways. Everything said about statement X = Y can be said about the equivalent statement X - Y =
0, and vice-versa. Hence, without loss of generality, we discuss the meaning of the statement
random variable X = 0.
X  0 Identically
The statement X  0 identically means that the numerical value of X() = 0 for all   S.
This is a very restrictive form of equality, one that is rarely needed in applications. Hence, we
seek a “looser” interpretation of statement X = 0.
X = 0 Almost Surely (a.s.) Means P[X = 0] =1
The statement X = 0 almost surely (a.s.) means P[X = 0] = P[{  S : X = 0}] = 1.
Often, this condition is stated as
1) X = 0 almost everywhere (a.e.)
2) X = 0 with probability one,
both equivalent phrases (used by different authors). It should be noted that X = 0 (a.s.) is NOT
equivalent to X  0 (i.e., X = 0 for all   S, or everywhere). If X = 0 (a.s.), the event B = { 
S : X  0} has probability zero, however it can be nonempty.
E[X2] = 0 is Equivalent to P[X = 0] = 1 (Same as X = 0 (a.s.))
In words, E[X ] = 0 is stated as X = 0 in mean square, or more simply, X = 0 (m.s.).
E[X ] = 0 is equivalent to P[X  0] = 0. To prove this, we show E[X = 0 if, and only if,
P[X  0] = 0. First, we show the “if” part: assume P[X  0] = 0. Then, X is a discrete random
variable with all probably concentrated at the origin; that is, its distribution function is F(x) =
U(x), a unit step. Observe that
2

 dF 
x 2   dx   x 2(x) dx  0


 dx 
E[ X ]  

Updates at http://www.ece.uah.edu/courses/ee385/
(11-106)
11-33
EE603 Class Notes
12/05/13
John Stensby
Second, we show the “only if” part: assume E[X = 0. With this, use the Generalized
Tchebycheff Inequality (see Chapter 2 of these class notes) to write
2
E[ X ]
2
P  X 2  1/ N  
 N E[ X ]  0


1/ N
(11-107)
for each integer N > 0. Now, note that
N




2
2


P  X  0  P X  0 = P   {X  1/ n}  P  limit  {X 2  1/ n} .


 n 1

 N  n 1

(11-108)
But, as indexed on n, sequence of events {X 2  1/ n} is nested increasing. Use continuity of
probability and (11-107) to write
N


N

P  X  0  P  limit  {X 2  1/ n}  limit P   {X 2  1/ n}  limit P  X 2  1/ N 

 N  n 1
 N   n 1
 N  
(11-109)
 0.
Equations (11-106) and (11-109) lead to the conclusion that
2
E[ X ]  0 if, and only if, P[X  0] = 0 (same as X = 0 (a.s.)).
(11-110)
It is worth repeating that P[X = 0] = 1 is not the equivalent to the statement X  0 for all   S.
Subspace of L2
M is said to be a subspace of L2 if it is a valid vector space (i.e., closed under scalar
multiplication and vector addition in addition to the other requirements given in any linear
algebra text) and M  L2. Subspaces play a crucial role in many applications that involve
Updates at http://www.ece.uah.edu/courses/ee385/
11-34
EE603 Class Notes
12/05/13
John Stensby
optimization problems.
Inner Product and Norm
It is natural to define an inner product on L2 as the expected value of a product. That is,
for any X  L2 and Y  L2, we denote the inner product (dot product or scalar product) as
X,Y, and we define
X, Y  E[XY] .
(11-111)
The Cauchy-Schwarz inequality (11-102) implies that
X, Y 
X, X
Y, Y
 .
(11-112)
That is, the inner product exists as a real number for every vector X and Y in L2. It can be shown
that inner product X,Y = E[XY] satisfies the properties
1. X, X  0, and X, X  0 if and only if X = 0 almost surely (i.e., P[X = 0] = 1),
2. X,Y  Y,X and
(11-113)
3. cX,Y  c X,Y , where c  R.
If E[X] = 0, then second moment X, X  E[X 2 ] is the variance of random variable X. Random
variables X and Y are said to be orthogonal if X, Y  E[XY]  0 .
In Part 1) of (11-113), the statement “X = 0 almost surely” is not equivalent to X
(i.e., X identically zero); so,
X, X  0 is not equivalent to X 
However, the
equivalence of X, X  0 and X  0 is a general requirement of an inner product, as defined in
almost all linear algebra books. However, in the applications literature, this subtle “issue” is
overlooked, and (11-111) is declared a valid inner product.
Updates at http://www.ece.uah.edu/courses/ee385/
11-35
EE603 Class Notes
12/05/13
John Stensby
Some authors change how random variables are defined/interpreted in an attempt to
remove the phrase “almost surely” from Part 1) of (11-113) and “fix” the above-mentioned
“issue”. They interpret a given X as a class of all random variables that are equal to X almost
surely. Two members of the same class can differ on some set B as long as P[B] = 0. All class
members will have the same expected value; sets of probability zero do not influence
expectations. When computing X, Y  E  XY  , X represents any member from its class, as
does Y; the expectation will be the same regardless of which class members are used.
Interpreting a random variable as a class of equivalent random variables allows us to “fix” Part 1
of (11-113), removing the phrase “almost surely”. In terms of equivalent classes, the statement
“X = 0” refers to a class of random variables, all of which are zero almost surely.
On a vector space, a vector norm maps vectors into real numbers in a manner that adapts
the concept of length to vectors. Almost universally, the norm of vector X is denoted as X .
On vector space L2, we define the norm of X as
X, X  E[X 2 ]
X 
(11-114)
(we say that the inner product induces the norm). From (11-113) it follows directly that (11-114)
satisfies
1. X  0. X  0 if, and only if, X = 0 almost surely (i.e., P[X = 0] = 1),
2. cX  c X , for any c  R and
(11-115)
3. X+Y  X  Y (the triangle inequality).
If E[X] = 0, then X is the standard deviation of X.
In Part 1) of (11-115), P[X = 0] = 1 is not equivalent to X (i.e., X identically zero);
so, X  0 is not equivalent to X  However, the equivalence of X  0 and X  0 is a
Updates at http://www.ece.uah.edu/courses/ee385/
11-36
EE603 Class Notes
12/05/13
John Stensby
general requirement of a vector norm (see any text on linear analysis).
However, in the
applications literature, this subtle “issue” is overlooked, and (11-114) is declared to be a valid
vector norm. This “problem” can be “fixed” by interpreting each random variable as a class, as
discussed above.
Often, norm (11-114) is called the mean-square norm since it involves the mean of the
square of a random variable. In terms of (11-114), we can restate the Schwarz inequality as
X, Y
 X Y .
(11-116)
Equation (11-116) is how the Schwarz inequality is usually stated in the analysis literature where
the notions of inner product and norm play central roles.
The triangle inequality (Part 3 of (11-115)) has a form similar to the well-known triangle
inequality for real numbers (which states that r1  r2  r1  r2 for any real numbers r1 and
r2 This inequality follows from the observation
X+Y
2
2
 X + Y, X + Y  X, X  2 X, Y  Y, Y  X  2 X Y  Y
2
(11-117)
 X  Y
2

which leads to the triangular inequality X + Y  X  Y .
The norm (11-114) allows us a way to define the equality of two vectors (random
variables). If X and Y are random variables for which
X-Y  0
(11-118)
we say that X = Y in the mean-square sense, or we say X = Y (m.s.). From (11-110), we see that
(11-118) is equivalent to P[X = Y] = 1 and P[X  Y] = 0.
Updates at http://www.ece.uah.edu/courses/ee385/
11-37
EE603 Class Notes
12/05/13
John Stensby
Convergence of Random Sequences
Often, one has to deal with sequences of random variables that converge to a random
variable. We say that the random sequence X(n;) converges to random variable X0() if for
every fixed  0  S the sequence of numbers X(n; ) converges to the number X0(). This is
"ordinary", sometimes called point-wise, sequence convergence (a topic that is usually covered
in a Calculus course) that has nothing to do with the fact that we are dealing with random
variables. Also, it is very restrictive. In applications, we can get by with much "weaker" modes
of convergence; we discuss three alternative convergence modes. In what follows, we discuss
almost sure (a.s.) convergence, convergence in probability (i.p.) and mean-square (m.s.)
convergence. Mean square convergence is convergence in the mean-square norm (11-114). We
discuss m.s. convergence first.
Mean-Square Convergence (m.s. Convergence)
As n goes to infinity, a sequence of random variables X(n)  L2 converges in m.s. to a
random variable X0  L2 if


½


limit X 0 - X(n)  0  same as limit E[{X0 - X(n)}2 ] = 0  .
n 
n 


(11-119)
The norm used in (11-119) is the mean-square norm given by (11-114). Often, this type of
convergence is denoted as
m.s
(11-120)
l.i.m X(n) = X 0 ,
(11-121)
X(n) 
 X0 ,
or
n 
Updates at http://www.ece.uah.edu/courses/ee385/
11-38
EE603 Class Notes
12/05/13
John Stensby
where l.i.m denotes limit in the mean.
Example 11-9: Let Z be a random variable with E  Z2   (i.e., Z  L2 ). .Let cn, n  0, be a
sequence of deterministic real numbers converging to real number c. Then, cnZ, n  0, is a
sequence of random variables. We show that
l.i.m c n Z  cZ .
(11-122)
n 
To see this, consider
2
2 2
2
2
E  c n Z  cZ   E  cn  c Z   cn  c E  Z  .






Now, cn  c and EZ2] <  implies EcnZ - cZ2]  0, and this proves (11-122).
Example 11-10: Consider the probability space (S,B,P), where S = [0, 1], B the Borel sets (B is
the -algebra generated by the open intervals on S. See Chapter 1 of class notes), and
P  B   d , B  B
(11-123)
B
(if B is an interval, then P[B] is the interval length. P can be thought of as a “generalized length”
of event B). Consider the sequence of random variable defined by
Fig. 6: Sequence of random variables
Updates at http://www.ece.uah.edu/courses/ee385/
11-39
EE603 Class Notes
12/05/13
John Stensby
0 1
X(n; )  1,
n
 0,
1
n
(11-124)
   1,
as illustrated by Fig. 6. This sequence has a point-wise limit given by
limit X(n; )  1,
0
n 
(11-125)
 0,   0
This sequence has zero as its mean-square limit since
limit
n 
X(n; )  0
2
 limit E[X 2 (n; )]  limit [1  1 ]  0
n 
n 
n
(11-126)
Theorem 11-5: Mean-square convergence is additive. That is, if
X 0  l.i.m X(n)
n 
(11-127)
Y0  l.i.m Y(n) ,
n 
then for any real-valued constants a and b we have
aX 0  bY0  l.i.m  aX(n)  bY(n)  .
n 
(11-128)
Proof: Note that
Updates at http://www.ece.uah.edu/courses/ee385/
11-40
EE603 Class Notes
12/05/13
John Stensby
{aX(n)  bY(n)}  {aX 0  bY0 }  a{X(n)  X 0 }  b{Y(n)  Y0 }
 a{X(n)  X0 }  b{Y(n)  Y0 }
(11-129)
= a X(n)  X 0  b Y(n)  Y0 .
However, Equation (11-127) ensures that the right-hand-side of (11-129) approaches zero as n
approaches infinity, and this proves (11-128).
Not every sequence of random variables has a mean square limit. We need tools and
techniques for determining if a sequence has a mean-square limit. Fortunately, our intuition is
helpful in this regard. Also helpful is some knowledge of real number sequences. Recall that
real number sequences have the Cauchy property. This property states that a real number
sequence {rn} converges if, and only if, rn - rm  0 as both n and m approach infinity. When
equipped with the Euclidean norm, the set of real numbers is complete, we say. Similarly,
sequences in L2 have the Cauchy property. This property states that a sequence X(n)  L2
converges (in the mean square norm) if, and only if, X(n)  X(m)  0 as both n and m
approach infinity. When equipped with the mean square norm, the set of L2 random variables is
complete, we say. Stated again, a random sequence X(n)  L2 has a mean-square limit X0 if, and
only if, it is Cauchy (that is, X(n)  X(m)  0 as both n and m approach infinity).
Mean-Square Cauchy Sequences and Completeness
Let X(n), n  0, be a sequence in L2. The sequence is said to be a mean-square Cauchy
sequence if
limit X(n) - X(m) = 0 .
n,m 
(11-130)
More tersely, we say that the sequence is m.s. Cauchy if (11-130) is true. For a m.s. Cauchy
sequence, the quantity X(n)  X(m) approaches zero as n and m approach infinity, in any
manner whatever. Basically, the further you “go out” in a mean-square Cauchy sequence the
Updates at http://www.ece.uah.edu/courses/ee385/
11-41
EE603 Class Notes
12/05/13
John Stensby
"closer" (in the mean-square sense) the elements become.
It is easy to show that mean-square convergence implies the mean-square Cauchy
property (i.e., (11-120) implies (11-130)). Actually, this is true for arbitrary normed vector
spaces (i.e., all convergent sequences are Cauchy sequences, regardless of the normed vector
space under consideration). However, for the general normed vector space, Cauchy sequences
are not necessarily convergence. But, for L2 space equipped with the mean-square norm, the
mean-square Cauchy property implies mean square convergence. This is stated by the following
theorem.
Theorem 11-6 (Special Case of Riesz-Fischer Theorem)
Vector space L2 is complete in the sense that a mean-square Cauchy sequence has a
unique limit in L2. That is, for sequence X(n) in L2, there exists a unique element X0  L2 such
that
limit X 0 - X(n)  0
n 


 denoted symbolically as l.i.m X(n) = X 0 
n 


(11-131)


 denoted symbolically as l.i.m [X(n) - X(m)]  0  .


n,m


(11-132)
if
limit X(n) - X(m) = 0
n,m 
Since the converse is true (see paragraph before the theorem statement), (11-131) and (11-132)
are equivalent for vector space L2. In (11-132), one must remember that the double limit is zero
regardless of how n and m approach infinity.
The value of Theorem 11.6 is this: we do not have to know/find the m.s. limit of a
sequence to know that the sequence is m.s. convergent.
To show that L2 sequence X(n)
converges to some m.s. limit X0, we need not know/find X0. Instead, to show convergence, it is
Updates at http://www.ece.uah.edu/courses/ee385/
11-42
EE603 Class Notes
12/05/13
John Stensby
sufficient to show that X(n) has elements that come arbitrarily close to one another as you “go
out” in the sequence. In some cases, establishing (11-132) is much easier than finding X0
described by (11-131).
With the introduction of Theorem 11.6, we have established L2 as a complete vector
space with norm (11-114) that is induced by inner product (11-111). In the literature, such
vector spaces are referred to as Hilbert Spaces. They are the natural setting for many significant
problems in Fourier series, communication theory, optimal filtering, etc.
Mean-square convergence has a number of useful properties. We discuss the ability to
interchange l.i.m and expectation. Also, we show that a mean-square limit is unique (with
equality in the mean-square sense). To develop these results, we must mention some (almost)
obvious, facts. Note that
l.i.m X(n)
(11-133)
n 
is a random variable, but
limit E[X(n)]
(11-134)
n 
is an "ordinary" limit of an "ordinary" sequence. Also, for any random variable X in L2, we have
2
E[X]  E  X   E  X  1  E  X 


12  X .
(11-135)
The first inequality results from the fact that the absolute value of an integral is less than, or
equal to, the integral of the absolute value. The second inequality comes from the CauchySchwarz inequality (11-102) with Y = 1. Now, we show that we can interchange expectation and
l.i.m.
Updates at http://www.ece.uah.edu/courses/ee385/
11-43
EE603 Class Notes
12/05/13
John Stensby
Theorem 11-7: Let X(n) be a sequence in L2. Suppose X(n) has a m.s. limit X0  L2 that is,
m.s
X(n) 
 X0

limit X(n) - X 0  0 .
n 
(11-136)
Then it follows that


E[X 0 ]  E  l.i.m X n   limit E[X(n)] .
 n   n 
(11-137)
That is, expectation and l.i.m are interchangeable.
Proof: Since L2 is complete, mean-square limit X0 is in L2 (X0 has a finite second moment), so
E[X0] exists (i.e., the mean is finite). Now, from (11-135), we have
E[X(n)]  E[X 0 ]  E[X(n)  X 0 ]  E  X(n)  X 0   X(n)  X 0 .
(11-138)
However, from (11-136) we know that the norm on the right-hand side of (11-138) goes to zero
as n approaches infinity. Hence, we have the desired result (11-137).
An important use of Theorem 11-7 deals with interchanging expectations and
summations. For k = 1, 2, … , let Xk  L2 be a sequence of random variables with finite second
moments. Define the nth partial sum
Yn 
n
 Xk .
(11-139)
k 1
Suppose that
Updates at http://www.ece.uah.edu/courses/ee385/
11-44
EE603 Class Notes
Y  l.i.m Yn  l.i.m
n 
12/05/13
John Stensby
n
 Xk .
(11-140)
n  k 1
We say that partial sum (11-139) converges in mean square to Y. By Theorem 11-7, we can
write
n
n





E  Y   E  l.i.m Yn   E  l.i.m  X k   limit  E[X k ]   E[X k ] .
 n  
k 1
 n  k 1  n  k 1
(11-141)
Theorem 11-8: The mean-square limit of a sequence is unique. That is, if
X 0  l.i.m X(n)
n 
Y0  l.i.m X(m)
m


limit X 0 - X(n)  0
n 
,
(11-142)
limit Y0 - X(m)  0
m 
then X 0  Y0  0 and P[X0 = Y0] = 1.
Proof: Observe that
X 0 - Y0 = {X 0 - X(n)} + {X(n) - Y0} < X 0 - X(n) + X(n) - Y0
(11-143)
from the triangle inequality. Now, on the right-hand side of (11-143), both norms go to zero as a
consequence of (11-142). Hence, we have X 0  Y0  0 as claimed. The fact that P[X0 = Y0] =
1 follows immediately from (11-110).
Example 11-11: We are trying to sample a DC voltage (for example, the output of a strain
gauge, water tank level detector, etc.). However, our samples contain additive noise; the kth
sample is Y(k) = mdc + (k), where mdc is the DC voltage we are trying to measure, and (k) is a
real-valued sample of stationary, zero mean noise with variance 2. We assume that (k) is
Updates at http://www.ece.uah.edu/courses/ee385/
11-45
EE603 Class Notes
12/05/13
John Stensby
uncorrelated from sample to sample (any two different-indexed samples are uncorrelated). We
try the “time-honored” technique of averaging out the noise. That is, we form the average
X(n) =
1 n
 Y(k) .
n k=1
(11-144)
Note that X(n) has mdc as its mean and 2/n as its variance (indeed, with increasing n, we are
“averaging out” the noise). However, the question remains: As n  , does the random
sequence X(n)  L2 converge in mean square to a random variable? Let’s see if the sequence is
mean-square Cauchy; consider
X(m) - X(n)
2
= E  [{X(m) - mdc }-{X(n) - mdc }]2 
 E {X(m) - mdc }2 - 2{X(m) - mdc }{X(n) - mdc }+ {X(n) - mdc }2 

(11-145)
2
2
.
 2E {X(m)  mdc }{X(n) - mdc } 
m
n
Consider the case n > m and use the fact that the noise is uncorrelated from sample to sample to
evaluate the middle term
E {X(m)  mdc }{X(n)  mdc }  E {X(m)  mdc } {X(m)  mdc }+ {X(n) - X(m)}
 E {X(m)  mdc }2   E  X(m)  mdc  E  X(n)  X(m) 



(11-146)
2
 00
m
Similarly, note that E[{X(m) - mdc}{X(n) - mdc}] = 2/n for the case m > n. Therefore, we can
write (11-145) as
Updates at http://www.ece.uah.edu/courses/ee385/
11-46
EE603 Class Notes
X(m) - X(n)
2
12/05/13
2
1
1
= 2  
+ .
 m min{n, m} n 
John Stensby
(11-147)
As m and n approach infinity (in any order), (11-147) approaches zero, so the sequence is meansquare Cauchy. By Theorem 11-6, the sequence is mean-square convergent. But what is its
limit? The obvious “candidate” is mdc. To see that this is the limit, consider
1 n

(k)  limit
0.

n  n k 1
n  n
limit X(n)  mdc  limit
n 
(11-148)
So, we see that X(n) converges in mean square to mdc (and we can expect to get “better” results
the more samples are included in the average).
With Example 11-10, we have established a Mean-Square Law of Large Numbers for
sequences of uncorrelated random variables. More general, let Yk, k = 1, 2, … , be a sequence of
uncorrelated random variables with common mean E[Yk] = m and common variance VAR[Yk] =
2. Then the sample mean
X(n) =
1 n
 Y(k)
n k=1
(11-149)
converges in mean square to m.
In a subsequent section, we will show that (11-149) converges to m in probability, a yetto-be-defined mode of convergence that is weaker than mean-square convergence. That sample
mean (11-149) converges in probability to m is just the well-known and popular Law of Large
Numbers, (weak version) that is cited often in the popular press.
Example 11-12: Let X(k), k  1, be a sequence of independent random variables each of which
is either 1 or 0. Furthermore, suppose that
Updates at http://www.ece.uah.edu/courses/ee385/
11-47
EE603 Class Notes
12/05/13
John Stensby
P[X(k) = 1] = 1/k
.
(11-150)
P[X(k) = 0] = 1-1/k
As k  , does X(k) converge in mean square? Let’s check the obvious candidate X = 0;
consider
limit X(n) - 0  limit 1/ n  0 .
n 
(11-151)
n 
So, we see that X(n) converges in mean square to the random variable X = 0. However, in
Example 11-16, we will see that X(n) does not converge (to zero) in a point-wise manner.
Example 11-13: Let X(k), k  1, be a sequence of independent random variables similar to the
previous example. However, suppose that X(k) is either k or 0 with
P[X(k) = k] = 1/k 2
.
(11-152)
P[X(k) = 0] = 1-1/k 2
So, as k becomes large, we see that X is getting larger with a smaller probability. Is X(k) mean
square convergent? To find out, consider
X(m) - X(n)
2
= E  X(m)2 - 2X(m)X(n) + X(n)2 


,
(11-153)
= 2[1-1/nm]
a result that converges to 2 as m, n approach infinity. Hence, X(n) is not mean-square Cauchy;
hence, it is not mean square convergent. The last two examples illustrate the fact that meansquare convergence depends on both the numerical values a sequence takes on and the
probabilities of taking on those values.
Updates at http://www.ece.uah.edu/courses/ee385/
11-48
EE603 Class Notes
12/05/13
John Stensby
Theorem 11-7 tells us that expectation and l.i.m. are interchangeable for m.s. convergent
sequences. A similar result holds for the inner product operation defined by (11-111).
Theorem 11-9 (Continuity of the Inner Product): Let X(n) and Y(m) be m.s. convergent
sequences with m.s. limits X0 and Y0, respectively, so that
X 0  l.i.m X(n)
n 
Y0  l.i.m Y(m)
m

limit X 0 - X(n)  0
n 

.
(11-154)
limit Y0 - Y(m)  0
m
Under these conditions, we claim that
X 0 , Y0  l.i.m X(n), l.i.m Y(m)  limit X(n), Y(m) .
n 
m
n,m
(11-155)
Proof: First, consider the simple algebra
X(n), Y(m) - X 0 , Y0 = X(n), Y(m) - X(n), Y0 + X(n), Y0 - X 0 , Y0
= X(n), Y(m) - Y0 + X(n) - X 0 , Y0
(11-156)
 X(n), Y(m) - Y0 + X(n) - X 0 , Y0
 X(n) Y(m) - Y0 + X(n) - X 0 Y0 .
m.s.
 X0
Now, since X(n) 
as n  , the sequence X(n) is bounded (can you show
this??), say X(n) < M. Use this fact, (11-154) and (11-156) to conclude
limit
n,m 
X(n), Y(m) - X0 , Y0  limit  M Y(m) - Y0 + X(n) - X0 Y0   0 ,
n,m 
Updates at http://www.ece.uah.edu/courses/ee385/
(11-157)
11-49
EE603 Class Notes
12/05/13
John Stensby
a result that proves (11-155) and the continuity of the inner product.
Theorem 11-9 establishes continuity of the inner product X,Y  E[XY]. What we mean
by this is simple. Suppose we are given sequences X(n) and Y(m) with m.s. limits X0 and Y0,
respectively, as described by (11-154). For “large” n and m, X(n) and Y(m) “get close” to X0
and Y0, respectively, and X(n),Y(m)  E[X(n)Y(m)] “gets close” to X0,Y0  E[X0,Y0]. This
intuitive idea is known as continuity of the inner product.
Convergence in Probability (i.p. Convergence)
Some results that involve mean square convergence of random sequences can be
generalized to a "weaker" convergence mode.
This new mode is called convergence in
probability. It is "weaker" (i.e., more general) than m.s. convergence; m.s. convergent sequences
also converge in probability, but the converse is not true.
As n   , a random sequence X(n) converges in probability (i.p.) to a random variable
X0 if, for every  > 0, we have
limit P  X(n) - X 0    0 .
n 
(11-158)
Often, this type of convergence is denoted by either of
i.p.
X(n) 
 X0
(11-159)
l.i.p X(n)  X .
(11-160)
n 
For convergence in probability, many of the results parallel those given above for m.s.
convergence. First, as we “go out” in a series (i.e., as the index becomes large), it may be more
likely that the terms are closer together (this does not mean that the terms must be closer together
in the m.s. sense). We say that a random sequence X(n) is Cauchy in probability if, for every  >
Updates at http://www.ece.uah.edu/courses/ee385/
11-50
EE603 Class Notes
12/05/13
John Stensby
0, we have
limit P  X(m) - X(n)     0 .
(11-161)
n,m
Cauchy in probability is a “weaker” condition than Cauchy in the mean square sense. A
sequence that is mean square Cauchy is also Cauchy in probability, but the converse is not true.
Condition (11-130) implies condition (11-161); however, the converse is not true. Next, we
provide a theorem that does for convergence in probability what Theorem 11-6 did for
convergence in mean square.
Theorem 11-10: As n  , a sequence X(n) converges in probability to a random variable X0
if, and only if, the sequence is Cauchy in probability.
Proof:
First, we show that if X(n) converges in probability to X0 then it is Cauchy in
probability. Suppose that the sequence converges in probability. Then note the event (i.e., set)
relationship
 X(m) - X(n) > ε   X(m) - X0
> ε/2   X(n) - X 0 > ε/2 ,
(11-162)
as depicted by Figure 11-7. From (11-162), we see that
P  X(m) - X(n) > ε   P  X(m) - X 0 > ε/2   P  X(n) - X 0 > ε/2  .
(11-163)
Longer Than 
X0
X(n)
X(m)
Longer Than 
Figure 11-7: If X(n) - X(m)   then either X(n) - X0  or X(m) - X0 
Updates at http://www.ece.uah.edu/courses/ee385/
11-51
EE603 Class Notes
12/05/13
John Stensby
Now, since X(n) converges to X0 in probability, both terms on the right hand side of (11-163)
approach zero as n and m approach infinity. Hence, the sequence is Cauchy in probability as
claimed. The converse (if X(n) is Cauchy in probability then it converges in probability) is
harder to prove and is not given here (see M. Loève, Probability Theory I, 4th Edition, pp. 117118).
Theorem 11-11: If a sequence converges in probability, then the limit is unique. That is,
suppose X(n) converges in probability to both X0 and Y0. Then it necessarily follows that P[X0
 Y0] = 0.
Proof: Using the same reasoning that led to (11-163), we can write
 X0  Y0     X0 - X(n)   / 2   Y0 - X(n)   / 2
(11-164)
P  X 0  Y0     P  X 0 - X(n)   / 2   P  Y0 - X(n)   / 2  .
(11-165)
However, both terms on the right-hand side of (11-165) approach zero as n approaches infinity.
Hence, for every  > 0 we have
P  X0  Y0     0 ,
(11-166)
so that
limit P  X 0  Y0     0 .
0+
(11-167)
Continuity of the probability measure (see Appendix 11B) and (11-167) lead to the conclusion
P  X 0  Y0  0  0 ,
Updates at http://www.ece.uah.edu/courses/ee385/
(11-168)
11-52
EE603 Class Notes
12/05/13
John Stensby
and this establishes the claim that P[X0  Y0] = 0. 
As claimed previously, convergence in mean square implies convergence in probability.
This claim is substantiated by the following theorem (which is a nice application of the
Tchebycheff inequality).
Theorem 11-12: Convergence in mean square implies convergence in probability.
Proof: Let X(n) be a sequence that converges in mean square to the random variable X0. For
each n, apply the generalized Tchebycheff inequality (see Chapter 2 of these notes) to X(n) - X0
and obtain
2
E  X(n) - X 0  X(n) - X 2


0
P  X(n) - X 0  ε  
=
2
2
ε
ε
(11-169)
m.s.
 X 0 , so that X(n) - X 0  0 as n  .
for every  > 0. However, we know that X(n) 
Hence, with (11-169), we have
limit P  X(n) - X 0     0 ,
n 
(11-170)
i.p.
so that X(n) 
 X 0 as claimed. 
Let’s reconsider Examples 11-11 and 11-12, both of which provided sequences that
converged in the mean square sense.
Now, we know that these sequences converge in
probability, as implied by Theorem 11-12.
Actually, that the sequence in Example 11-11
converges in probability is just a statement of the Law of Large Numbers (weak version).
Theorem 11-13 (The Weak Law of Large Numbers): Let X(n) be a sequence of independent,
identically distributed (i.i.d) random variables with mean X and variance 2X . Then, the sample
mean
Updates at http://www.ece.uah.edu/courses/ee385/
11-53
EE603 Class Notes
ˆ n 
12/05/13
John Stensby
1 n
 X(k)
n k=1
(11-171)
converges in probability to the “real” mean X as n approaches infinity.
Proof: The proof of this theorem follows from Example 11-11 and Theorem 11-12.
The law of large numbers is the basis for estimating X from measurements.
In
applications, it is common to take the sample mean (11-171) as an estimate of the “real” mean
X. The basis for doing this is the Law of Large Numbers.
Example 11-14: In Example 11-13, we considered a sequence of independent random variables
X(k), k  1, with
P[X(k) = k] = 1/k2
P[X(k) = 0] = 1 - 1/k2.
We found out that this sequence does not converge in the mean square sense (a “sufficient
number” of the sample function sequences contain a “sufficient number” of instances where X(k)
= k so that m.s. convergence is not achieved). Now, we show that it does converge in probability
to X0 = 0. For every  > 0, we have
limit P  X(k) - X 0     limit P  X(k) >    limit P  X(k) = k   limit 1/ k 2   0 ,

k
k 
k 
k  
(11-172)
and we see that the sequence converges in probability to zero (in (11-172), only probabilities
that X(k) = k are involved; for k  1, the actual numerical values of X(k) do not enter into the
computation).
The converse of Theorem 11-12 is not true (convergence in probability does not imply
convergence in mean square), and Example 11-14 is a counter example that establishes this fact.
Updates at http://www.ece.uah.edu/courses/ee385/
11-54
EE603 Class Notes
12/05/13
John Stensby
Basically, convergence in mean square is dependent upon both the numerical values of the
sequence elements and the probabilities associated with the values.
On the other hand,
convergence in probability is only concerned with the probabilities.
Example 11-15: For convergence in probability, this example shows that one cannot interchange
the limit and expectation operations. For n  1, consider the sequence X(n), where X(n) is either
= -1 or n. Also, suppose that
P  X(n) = n  = 1/n
P  X(n) = -1 = 1-1/n
.
(11-173)
The sequence converges in probability to X0 = -1 since
limit P  X(k) - X 0     limit P  X(k) - (-1)   
k
k
 limit P  X(k) = k   limit 1/k   0
k
.
(11-174)
k
Now, we look at mean values. Clearly, E[X0] = -1, and
1
 1 1
E  X(n)  = n   + (-1)  1-  = ,
n
 n n
(11-175)
which has a zero limit as n approaches infinity. Hence, we have shown that
limit E  X(n)   0  E[ l.i.p X(n) ]  E[X 0 ]  1 .
n 
n 
(11-176)
Example 11-15 serves as a counter example that shows that you cannot, in general,
interchange the operations of limit in probability and expectation. That is, it is not generally true
Updates at http://www.ece.uah.edu/courses/ee385/
11-55
EE603 Class Notes
12/05/13
John Stensby
that


E  l.i.p X(n) 
 n 

and
limit E  X(n)
n 
produce the same value (This differs from mean square convergence; recall that Theorem 11-7
proved that expectation and l.i.m are interchangeable). So, while convergence in probability is
very general (and weak), there are limitations on what you can do with it.
Convergence Almost Surely (a.s. Convergence)
The last form of convergence we will study is called almost surely (a.s.) convergence.
The random sequence X(n) converges almost surely to the random variable X0 if the sequence of
functions X(n;) converges to X0() for all   S except possibly on a set of probability zero
(recall that S denotes the sample space). Almost surely convergence requires that




P  limit X(n)  X 0   P    S : limit X(n;  )  X 0 ( )   1 .
n 
 n 



(11-177)
In other words, X(n) converges almost surely to random variable X0 if there exists an event A,
_
with P(A) = 1 (and P(A) = 0), for which X(n;)  X0() for all   A. Often we write
a.s.
X(n) 
 X0 .
(11-178)
Obviously, this type of convergence is “weaker” than pointwise (p.w.) convergence (p.w.
Updates at http://www.ece.uah.edu/courses/ee385/
11-56
EE603 Class Notes
12/05/13
John Stensby
convergence requires that X(n;)  X0() for all   S). However, as shown below, almost
surely (a.s.) convergence implies convergence in probability (i.p.). And, it doesn’t imply, nor is
it implied by, convergence in mean square (m.s.). In the literature, a.s. convergence goes by the
names convergence almost everywhere and convergence with probability one (other names are
used as well).
Like convergence in mean square and probability, in the context of almost sure
convergence, it is possible to examine the separation, or distance, between sequence elements as
we “go farther out” in a sequence. We say that X(n) is an almost surely Cauchy sequence if




P  limit X(n)  X(m)  0  P    S : limit X(n; )  X(m; )  0  1 .
n,m
 n,m



(11-179)
In other words, there exists an event A, P(A) = 1, for which
limit X(n; )  X(m; )  0
n,m
(11-180)
for all   A. To establish that X(n) is an almost surely Cauchy sequence, we do not require
knowledge of a sequence limit.
With regard to necessary and sufficient conditions for the Cauchy criteria, almost surely
convergence parallels m.s. and i.p. convergence. To show almost surely convergence of a
sequence, it is not necessary to come up with a limit (in the almost surely convergent sense) for
the sequence. Instead, as shown by the following theorem, we can use the Cauchy criteria.
Theorem 11-14: A sequence X(n) is almost surely convergent if, and only if, it is an almost
surely Cauchy sequence.
Proof: This theorem follows from the fact that, in the real number system, sequences of real
numbers converge if, and only if, they are Cauchy sequences.
A practical and useful test for almost surely convergence is given by the following
Updates at http://www.ece.uah.edu/courses/ee385/
11-57
EE603 Class Notes
12/05/13
John Stensby
theorem.
Theorem 11-15: Let X(n) denote a sequence of random variables. Suppose that X(n) converges
to random variable X0 almost surely; that is, we suppose that
a.s.
X(n) 
 X0 .
(11-181)
Then, for every  > 0 we have
 

limit P  X(n) - X 0   for all n  m   limit P   { X(n) - X 0  }  1 ,
m 
m   n=m

(11-182)
which we write as
limit P  A m   1 ,
(11-183)
m 
where Am is defined as
A m   S : X(n; ) - X 0 ()   for all n  m  

 { S : X(n; )  X0 ()  } ,
(11-184)
n m
an event that depends on m and . The converse is true as well; hence, (11-182) and (11-181) are
equivalent (i.e., one implies the other).
Note: The sequence Am , m  0, is nested increasing with m; that is, Am  Am+1 for all m and all
 > 0. Also, the complement of (11-184) is (DeMorgan’s Laws come in handy here)
Updates at http://www.ece.uah.edu/courses/ee385/
11-58
EE603 Class Notes
12/05/13
John Stensby
A m   S : X(n; ) - X 0 ( )   for some n  m 

(11-185)

 { S : X(n; )  X0 ()  }.
nm
a.s.
So, Theorem 11-15 is sometimes stated as: X(n) 
 X 0 iff for all  > 0 we have
limit P  X(n)  X 0   for some n  m   limit P  A m   0 .
m 
(11-186)
m 
a.s.
Proof: First, suppose that X(n) 
 X 0 . Then, there exists an event 1 for which
P 1   1
P {S  1}  P  1   0
(11-187)
limit X(n; )  X0 () for each  1.
n 

Now, show that 1   k 1 A k . Take any 0   . As shown by (11-187), X(n;0) converges
in an “ordinary” sense to X0(0); this means that, given any  > 0, there exists an integer m(0,)
(integer m depends on 0 and  with the property
X(n, 0 )  X 0 ( 0 )  
(11-188)
for n  m(0,). Hence, we see that 0  A k , for all k  m(0,); that is, we can write
0  1  0  A k 

 { S : X(n;)  X0 ()  },
k  m(0 ,) .
(11-189)
n=k
Since the Ak are nested increasing, we have
Updates at http://www.ece.uah.edu/courses/ee385/
11-59
EE603 Class Notes
1 
12/05/13
John Stensby

 Ak .
(11-190)
k=1
Since P(1) = 1, Equation (11-190) yields
 

P   Ak   1 .


 k=1 
(11-191)
This leads to the conclusion
n
 



 n

1  P   A k   P  limit  A k   limit P   A k   limit P(A n ) ,


 n 
 n  

k =1
 k =1 


 k=1  n 
(11-192)
a.s.
and we have proven that (11-181), which states X(n) 
 X 0 , implies (11-182), which states
n 
P[A n ]  1 . Now, we show the converse; we show that (11-182) implies (11-181). We
do this by showing that a false (11-181) implies a false (11-182) (this is the contrapositive of the
statement “(11-182) implies (11-181)”). Hence, assume that (11-181) is false and show that
limit P[A m ]  1 (i.e., (11-182) is false). If (11-181) is false there exists an event , P() > 0,
m
such that X(n,) 
/ X0() for    ( i.e., convergence does not occur for   ) . Now,
consider the random variable
Z( )  lim sup X(n, )  X 0 ( ) ,  S .
n
(11-193)
The event {  S : Z() > 0} can be expressed as
{ S : Z( )  0} 

 { S : Z()  1/ n} .
(11-194)
n 1
Updates at http://www.ece.uah.edu/courses/ee385/
11-60
EE603 Class Notes
12/05/13
John Stensby
For each 0  , we have Z(0) > 0, so 0  {  S : Z() > 0}; this fact implies that
   S : Z()  0.
(11-195)
Now, P() > 0 implies P ({  S : Z()  0 })  0 and the existence of some integer n0 for
which the event {  S : Z() > 1/n0} has a strictly positive probability (to see this, equate the
probability of both sides of (11-194) and use the continuity of P). That is, we have
P[ {   S : Z() > 1/n0 } ] > 0.
(11-196)
But, this positive probability event is contained in the complement of Am, m  1, defined using 
= 1/n0. This observation is written as
{  S : Z() > 1/ n0}  A m 

 { S : X(n, )  X0 ()  1/ n0} ,
(11-197)
n=m
for every integer m (apply DeMorgan’s Law to (11-184) to get this complement). Hence, for
every integer m, we have
P(A m )  P( { S : Z()   )  0 ,
(11-198)
so that P(A m ) is bounded away from zero, and P(Am) is bounded away from unity, as m  .
Hence, Equation (11-183) (equivalently, Equation (11-182)) cannot be true; we have shown that
a false (11-181) implies a false (11-182) (equivalently, we have shown that (11-182) implies
(11-181)).


Updates at http://www.ece.uah.edu/courses/ee385/
11-61
EE603 Class Notes
12/05/13
John Stensby
Theorem 11-16: Almost surely (a.s.) convergence implies convergence in probability (i.p.).
Proof: This is easy to show. Suppose that X(n)  X0 almost surely (a.s.) so that P(A m )  0
as m   for any fixed (but arbitrary)  > 0 used in the definition of Am. Note that
{ S : X(m, )  X 0 ()  }  A m 

 { S : X(n, )  X0 ()  } .
(11-199)
n=m
Hence, P(A m )  0 as m   implies that P({ X(m)  X 0  })  0 as m   , and we have
X(m)  X0 in probability (i.p.).
Theorem 11-16 shows that a.s. convergence implies convergence in probability; however, the
converse is not true, as shown by the next example.
Example 11-16: This example shows that convergence in mean square (m.s.) does not imply
convergence almost surely (a.s.).
Recall that Example 11-12 discussed a binary random
sequence X(k), all independent random variables, with
P[X(k)  1]  1/ k
.
(11-200)
P[X(k)  0]  1  1/ k
In Example 11-12, we saw that X(k) converges in mean square (m.s.) to X0  0 (hence, it also
converges in probability (i.p.) to X0  0). Now, we show that this sequence does not converge
almost surely (a.s.). In terms of Am given by (11-184), observe that
Updates at http://www.ece.uah.edu/courses/ee385/
11-62
EE603 Class Notes
12/05/13

John Stensby

limit P[A n ]  limit P   { X(m) - X 0  }  limit P   {X(m) = 0}
 n   m  n

n 
n   m  n
 limit( 1  1 )( 1  1 ) 
n
n 
 limit
n 1

 ( 1  n 1m )
n  m=0
(11-201)
 

 limit exp    1 
n m
n 
 m=0

 0.
Since this limit is not unity, X(m) cannot converge almost surely to X0 = 0 (study again Equation
(11-182)). What we have provided here is a counter example that shows that mean square (m.s.)
convergence does not imply almost surely (a.s.) convergence. Also, the example shows that
convergence in probability (i.p.) does not imply convergence almost surely (a.s.). Also, see
Stark and Woods (3rd Edition), Example 6.7-3, p. 381 for a similar example.
Example 11-17: This example shows that convergence almost surely (a.s.) does not imply
convergence in mean square (m.s.). Recall that Example 11-13 presented a binary random
sequence X(k) of independent random variables with
P[X(k)  k]  1/k 2
.
(11-202)
P[X(k)  0]  1-1/k 2
As shown by Example 11-13, this sequence is not mean square (m.s.) convergent. We show that
X(k) converges almost surely (a.s.) to X0 = 0. In terms of A m defined by (11-199), observe that
Updates at http://www.ece.uah.edu/courses/ee385/
11-63
EE603 Class Notes
12/05/13

John Stensby

limit P[A n ]  limit P   { X(m) - X 0  }  limit P   {X(m)  m}
 n  

n 
n   m=n
mn
(11-203)

1/ m2  0.

n 
 limit
m=n
Equivalently, in terms of An given by (11-184), this last result implies that
 

limit P[A n ]  limit P   { X(m) - X 0  }  1 .
n 
n   m=n

(11-204)
From Theorem 11-15 (see Equation (11-182)), we can conclude that X(n) converges almost
surely (a.s.) to X0 = 0. Together with Example 11-13, this example shows that convergence
almost surely (a.s.) does not imply convergence in mean square (m.s.). Also, this example shows
that convergence in probability (i.p.) does not imply convergence in mean square (m.s.).
The next example is somewhat counter intuitive. It demonstrates that convergence point
wise does not imply convergence in mean square, in general. Even though X (n; )  X 0 () for
all   S (i.e., the random variable converges point wise), the integral in the computation of
E[X[n;] - X02] may diverge so that X(n) does not converge to X0 in mean square.
Example 11-18: Consider the probability space (S,B,P), where S = [0, 1], B the Borel sets (B is
the -algebra generated by the open intervals on S. See Chapter 1 of class notes), and
P  B   d , B  B
(11-205)
B
(if B is an interval, then P[B] is the interval length. P can be thought of as a “generalized length”
of event B). For   S, define the random variable sequence
Updates at http://www.ece.uah.edu/courses/ee385/
11-64
EE603 Class Notes
12/05/13
John Stensby
Figure 11-8: Relationship between modes of convergence.
 n, 1    2
n
n

X(n, )  n I[ 1 , 2 ] ()  
n n
 0, otherwise
(11-206)
(note that IB() is called the Indicator Function). On S, X(n) converges to zero in a pointwise
p.w.
manner. We say that X(n)  0 .
Sometimes, we say that X(n) converges everywhere or
surely. However, sequence X(n) does not converge to zero in the mean square sense since
X(n)  0
2
2
2 1
 E  X(n)   n 2     n .


n n
(11-207)
Venn Diagram Describing Convergence Modes
Figure 11-8 shows a Venn diagram that depicts the interrelationships between i.p., m.s.,
a.s., and p.w. convergence. The diagram follows directly from the definitions, theorems and
counter examples given in this chapter. Mean square convergence neither implies, nor is it
implied by, a.s. convergence; see Examples 11-16 and 11-17 for relevant counterexamples. The
fact that p.w. convergence does not imply m.s. convergence is established by Example 11-18.
Theorem 11-12 (alternatively, Theorem 11-16) establishes that m.s. (alternatively, a.s.)
convergence implies i.p. convergence.
Updates at http://www.ece.uah.edu/courses/ee385/
11-65