Download Random

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Johns Hopkins University
What is Engineering?
Michael Karweit
Uncertainty
Many physical processes, like the falling of a ball under the influence of gravity,
or the motion of the sun across the sky, or the acceleration of air across the surface of an
airplane wing, can be modeled explicitly as a set of equations. Then, if we want to
predict, for example, the distance a ball falls in one second under gravity when it starts
from rest, we simply plug in the value for gravity g = 9.8 m/s2 and t = 1 s, and use the
predictive equation s = ½ g t2 to obtain s = 4.6 m. At this level of sophistication, this is a
deterministic process, i.e., we can specify the parameters and initial conditions and,
then, calculate the results.
Let’s contrast this with another physical process: that of rolling a die. Now, what
is the predictive equation for the outcome? In principle, we could come up with a set of
equations but they would horrendously complicated and not at all practical. So, in effect,
we have a process whose outcome is unpredictable. When a process produces an
outcome that is almost completely unpredictable, we say it is a random process or
stochastic process.
The distinction between a deterministic process and a stochastic one is not
absolute. In our rolling die example, we could produce a set of equations to model its
motion. But the equations would depend critically on a number of parameters, whose
specifications might be almost impossible to prescribe. For example, the initial
orientation of the die, its initial release velocity and angular momentum, its height above
the table, the characteristics of the tables surface would all play critical roles in how the
die would come to rest. Even the geometry would be important—whether the edges were
sharp or rounded. Not only rounded, but how much rounded. In practice you could
never prescribe these parameters with sufficient accuracy to ensure that the equations
would predict the outcome. So, in practice, the process is random.
(But sometimes, an apparently-perfect random process is not so perfect. The
roulette wheel in gambling casinos is supposed to be perfectly random. In this apparatus
a small ball is supposed to fall with equal probability in one of the 38 numbered slots in
the wheel. But some years ago, a graduate student, Albert Hibbs, at the University of
Chicago and visiting Las Vegas, began recording the slots into which the ball fell on one
roulette wheel. After several days, he realized that the wheel favored some slots over
others, i.e., it was imperfectly made. So with some strategic wagering he was able to
“beat the odds”, and, additionally, made a name for himself.)
How completely we model a physical process will determine how well we can
predict behavior. Usually we try to include those parameters that have large and
systematic influences. That is, we try to include parameters that affect the process in a
significant and expected way. In our first example of a falling ball, our model was the
simple relationship s = ½ g t2 . One of the parameters neglected in this model is air
resistance. If we were to use this simple equation as a predictor of an actual experiment,
we would always overestimate distance, because the neglected ingredient always acts to
slow the ball. That is, the element that we neglected had a systematic effect.
Uncertainty
1
Johns Hopkins University
What is Engineering?
Michael Karweit
At some point we stop trying to include additional parameters in our model
because, the model becomes too cumbersome. And we hope that the remaining
parameters will have a relatively small influence on prediction. Further, we hope that
these unmodeled parameters will partially cancel one another out, i.e., they won’t all act
to affect the process in one direction. (Eventually, there is a limit beyond which
deterministic modeling is not possible. You’ve probably heard of the Heisenberg
uncertainty principle.) Often a process is considered mostly deterministic or mostly
random based on practicality. Is it worth it to expand the model to include additional
details?
Let’s take the example of water pressure in a municipal water system. If, at some
point within the system, you were to measure pressure as a function of time you would
discover it fluctuates, sometimes smoothly, sometimes wildly. In principle, one could
model the system and make predictions from it, but the system is so complex that it
would not be practical to do so in detail. Pump pressure, pipe diameter, and pipe
roughness could be relatively easily included in a model. Those parameters are relatively
constant. But water pressure also depends on flow rate. So, every time someone turns on
a faucet or flushes a toilet, it has an effect. And it would be impossible to monitor all
faucets in the system to be able to include their individual impacts on the system. So,
what to do?
One answer is to predict “in the average”. That is, make some assumptions about
the distribution of faucets turned on and toilets flushed by time of day, and use and a
measure or statistic to characterize a “typical” situation, perhaps by time of day or day of
week. Then use this statistic to approximate the actual conditions in the model.
Predictions resulting from an averaged input will be imperfect, but they could be
reasonably useful. (Actually one of the more interesting problems for water-works
engineers and one that defies getting usable predictions from “averaged” inputs is the
Superbowl problem. In a period of about twenty minutes during Superbowl half-time, 50
million toilets are flushed throughout the U.S.—a once a year occurrence. This wreaks
havoc with municipal water systems.)
Uncertainty and randomness enter the engineering world in yet another way:
measurement. If fifty people were given meter sticks and you asked them to measure the
length of a soccer field, you would get fifty different answers. They might be closely
clustered, but they would be different. Why? The soccer field isn’t changing in size.
The answer is that errors are introduced in taking the measurement. Maybe the meter
sticks are not all exactly one meter long. Maybe the tick marks on the meter sticks were
read incorrectly. Maybe the number of meter stick lengths along the field was
miscounted. Maybe the meter sticks were not laid in a straight line. There are a lot of
“maybes”.
So, with fifty different answers, how long is the soccer field? Really, we can’t
tell. But we can estimate its most likely length by taking the average value of all the
measurements. So, let’s say that average value is 112 m. How confident are we that the
Uncertainty
2
Johns Hopkins University
What is Engineering?
Michael Karweit
actual length is 112m? If all fifty measurements lie between 111.5m and 112.5m, we’re
pretty confident. But, if the fifty measurements lie between 100m and 120m, we would
be far less sure. So, the spread or distribution of values makes a difference in the
confidence of our estimate. Can we actually quantify measurement confidence? How
can we deal with non-deterministic quantities? How can we characterize random
processes or distributions of outcomes? These are all questions vital to engineering. And
they are addressed in terms of probability and statistics.
DISTRIBUTIONS
With random processes we can never predict a specific outcome—that’s what
makes it random. But we might be able to deduce the likelihood that a particular
outcome will occur. That can be very helpful. But determining this likelihood requires
knowledge of the distribution of possible outcomes. Sometimes we can infer what the
distribution is; other times we cannot. In the case of a “perfect” die, we presume that
each side of the die is equally likely to land face up. So, 1/6th of the time we would
expect to find a “2” face up, for example. And the same would be true for each of the
other numbers. Another way of saying it is that the probability of getting a “2” on any
one roll of the die would be 1/6 or 0.16667.
Probability is the likelihood that an event will occur, or a particular outcome will
occur. Probabilities always lie between 0.0 and 1.0. If the probability of an event is 0.0,
that means it will never occur. If the probability of an outcome is 1.0, that means it is
certain to occur.
What is the probability that a “1” or a “2” or a “3” or a “4” or a “5” or a “6” will
occur in one roll of the die? Since this event encompasses every possible outcome, its
probability must be 1.0 (presuming that the die cannot end up on an edge or corner).
That is, the sum of probabilities of all possible outcomes is 1.0. This fact allows us to
define a probability distribution function f(n), where n is a particular outcome. For a
rolled die f(1) = f(2) = f(3) = f(4) = f(5) = f(6) = 0.16667. A plot of this function looks
like this:
f(n)
0.16667
n
1
2
3
4
5
6
This is a uniform distribution or flat distribution, i.e., each outcome is equally
likely to occur. And, since we have scaled the values so that the sum of the heights of the
rectangles: 6 * 0.16667 = 1.0, this plot can be thought of as a probability distribution
Uncertainty
3
Johns Hopkins University
What is Engineering?
Michael Karweit
n N
function. Mathematically it can be written as
 f ( n )  1 .0 .
This is a property of
n 1
probability distribution functions.
An event can also be more complicated—for example the rolling of two dice.
Then we might define the outcome as the sum of the spots on the two dice. In this case
there are 6 x 6 = 36 possible ways the dice can land, each equally likely. But in those 36
ways, there are only 11 possible outcomes: the values 2 through 12. But each of these
values is not equally likely to occur. There is only one way to obtain a 2—when both
dice show a “1”. But there are four ways of obtaining a 5: (1,4), (4,1), (2,3), (3,2). So, if
every combination of faces is equally likely to occur, one would expect a 5 to occur
4/36th s of the time and a 1 to occur 1/36th of the time. Again, it’s useful to plot the
probability distribution function:
f(n)
0.16667
n
2 3 4 5 6 7 8 9 10 11 12
And, again, because this plot includes every possible outcome, the sum of the heights of
the rectangles is one.
The rolling of dice is an example of a random process which produces discrete
outcomes, i.e., outcomes which one can enumerate. There are also random processes
which produce non-denumerable or continuous outcomes, e.g., the angle at which a
spinner stops turning. Here, the probability that the spinner will stop at, say,
30.0123456 is almost zero. The reason is that 30.0123456 is only one of an infinite
number of possible values. So, how can this be useful? The answer is, that if we specify
a range of values over which the spinner may stop, then the probability becomes finite.
For example, the probability that the spinner will stop between 30 and 31 is 1/360.
With continuous outcomes, we no longer speak of probability distribution
functions, but rather of probability density functions. And, we can no longer scale the
N
sum of all possible outcomes f(n) as
 f (n) 1.0 because we cannot enumerate
n 1
individual outcomes. But we can write the equivalent expression using calculus. Let x
be a continuum of outcomes and f(x) be the probability density function of the occurrence
Uncertainty
4
Johns Hopkins University
What is Engineering?
Michael Karweit

of x; then, over all possible values of x, the integral
 f ( x)dx  1.0.
The probability

density function for the spinner is uniform and would be plotted as follows:
f(x)
1/360
x
0
360
In the case of discrete outcomes, the sum of heights of the rectangles must add to
1.0. In the case of continuous outcomes, the area under the curve must equal 1.0. Then,
the probability of obtaining a value between x = a and x = b is the area under the curve
between a and b.
n and x in the discussion above are called random variables. Their values are
distributed according to the generating random process.
Depending on the underlying random process, probability distribution or density
functions can take on many forms. One of the more well-known is this one:
f(x)
x
This is often called a bell-shaped distribution. More formally, it is known as a
Gaussian distribution or normal distribution. It can be skinny or wide, but the area
under the curve is always 1.0. This function is extremely important in engineering.
We’ll discuss it in detail a little later.
One of the difficulties in studying random processes is that we almost never know
what the probability density function (pdf) is. Sometimes we have a general idea about
its shape, but not much more. In fact, to learn more, we usually infer its characteristics
by taking sample outcomes. From these data, we try to estimate its form. Remember,
there is some underlying process whose outcomes are probabilistically distributed. It is
the characteristics of that underlying process that we are trying to determine. (In the case
of Albert Hibbs, everyone’s initial expectation was that the roulette wheel had a flat or
uniform distribution function, i.e., each number on the wheel was equally likely to occur.
But Hibbs collected data and concluded that the distribution function was not perfectly
Uncertainty
5
Johns Hopkins University
What is Engineering?
Michael Karweit
uniform—some numbers were more likely to occur than others. Based on his estimate of
the wheel’s distribution function, he was able to improve his odds of winning.)
Deducing pdfs, however, is not easy, because they don’t necessarily have
analytical forms, i.e., they may not be explicitly expressible as mathematical functions.
Nevertheless, we can learn something about a pdf’s characteristics by exploring its
moments. It turns out that every pdf can be expressed in terms of an infinite number of
parameters called moments.
In general, moments denote the effect of something which is applied at a distance.
For example, in physics there is the concept of torque—the twisting effect of applying a
force at the end of a lever arm. If the force is applied at right angles to the lever arm,
then the torque T = r * F, the length of the lever arm r, and the magnitude of the force F.
Torque is a moment.
In probability and statistics the idea is the same, except that the “something” is an
outcome, and the “distance” is how far that outcome is from zero or an average value.
Statistical moments characterize the shape of the distribution. For discrete and
continuous random processes, respectively, the Pth moment is defined as:

N
m P   f ( xi )( xi   )
P
, mP 
i 1
N
where    f ( xi ) xi and  
i 1
 f ( x)( x   )
P
dx


 f ( x) xdx , respectively.
For the discrete cases, recall that

N is the total number of possible outcomes.
mP is called the Pth moment about the mean.  is the mean or the average value of
x. In fact,  is the 1st moment about zero. Each mP emphasizes a feature in the
distribution of f(x). So, if we knew the values of all the mPs, we could deduce the actual
probability density function f(x). Knowing all the moments tells us everything there is to
know about f(x). But, there are two problems: first, there are an infinite number of
moments; second, all we will have at our disposal is a set of sample outcomes produced
by the underlying random process.
It turns out that the first problem is not so serious, because in most applications,
almost everything we would like to know about a pdf is contained in the first several
moments. In fact, only rarely are we interested in more than the first four moments. The
second problem is a little more serious, because the best we can do is estimate what those
moments are. We will never be able to really know they are, but with enough sampling,
we can estimate them arbitrarily closely.
So, how do we proceed? First, suppose we have obtained N sample outcomes
from some random process. (Note, that we will now use N to denote the number of
samples, not the number of possible outcomes, as before.) So, we have a list of xi ‘s
Uncertainty
6
Johns Hopkins University
What is Engineering?
Michael Karweit
for i = 1,N. The first moment that we’re interested in is the 1st moment about zero, i.e.,
1 N
the mean. To find the mean, we use the formula: x   xi .What’s going on here?
N i 1
First, why is the mean denoted by x and not  ? Second, why is the formula so different
from what was given before? (Actually, this is the equation that probably looks most
familiar.) The answer to the first question is that we can’t calculate . It’s a property of
the underlying pdf. We can only estimate  from our sample data. We denote that
estimate as x . Our expectation is that x is close to . In fact, we’ll even be able to
calculate how close it’s likely to be. Now the second question. In our original formula
for calculating , we considered all possible values of x, and we “weighted” them by
their probability of occurring. In adding up (or integrating) all weighted values, we
arrived at . In estimating x , however, the probability of getting a certain value of x has
already been taken into account by the sampling procedure. x’s with low probabilities of
occurring, are not found very often in the sample. So, with the relative distribution of x’s
already accounted for, adding the unweighted samples is equivalent. And, since we’re
interested in the “per sample” average of x, we divide that sum by N.
So, what can we do with our new-found knowledge, an estimate of the mean?
Not a lot. It simply gives us a zero reference point around which outcomes from the
random process will occur. To illustrate, here are some pdfs, all with the same mean  =
10.
f(x)
f(x)
f(x)
f(x)
10
x
10
10
x
10
x
10
0
We begin to get some really useful information with M2, the second moment
about the mean. This moment also has a special name: the variance. Here the
1 N
calculation is s 2   ( xi  x ) 2 , where s2 is a sample estimate of the true variance 2.
N i 1
The variance is a measure of the spread of the distribution. Often we’re most interested
in
s 2 , the square-root of the variance. This is called the standard deviation (s.d.) or
standard error, depending on the application. Notice that s 2 has the same units as x,
i.e., if x is in meters, then the standard deviation is in meters; if x is a score on an exam,
then, the standard deviation is a score.
In some cases we may not care to know much more about the pdf. For example,
we might want to know how well we did on an exam. Here we are usually satisfied with
how well we did with respect to the class average. Scores on an exam can be considered
Uncertainty
7
Johns Hopkins University
What is Engineering?
Michael Karweit
outcomes of a random process whose distribution is usually “bell-shaped”. So, knowing
the class average and the standard deviation is information enough to make that
determination. If your score is, say, 2 s.d.’s above the class average, then you know you
did very well, because scores above 2 s.d.’s occur only 2% of the time. But that last
tidbit of information arises from your assuming that the distribution is “bell shaped”. So,
you could be wrong.
There are two more moments that are very helpful in characterizing a distribution
function: m3 and m4. These two moments by themselves however can be difficult to
interpret, because they have units x3 and x4, respectively. So their magnitudes will
depend on the units of the random variable x. To make these moments a bit more useful,
it is customary to non-dimensionalize them, i.e., normalize them with respect to another
parameter—the variance. Normalizing these two moments gives us the non-dimensional
statistics skewness and kurtosis:
skewness =
1
N
N
 ( xi  x ) 3
i 1
2 3/ 2
(s )
, and kurtosis =
1
N
N
 (x
i 1
i
 x)4
(s 2 ) 2
.
The skewness is a measure of symmetry. A distribution with zero skewness will
be perfectly symmetric. If the skewness is non-zero, the magnitude of the skewness
indicates how lopsided the distribution is. Notice, that we wouldn’t be able to make that
interpretation if only m3 were used, because different units for x would produce different
values for m3 even if they came from the same underlying distribution. Nondimensionalizing parameters is a very useful practice in statistics in particular, and in
engineering in general.
Finally, we have kurtosis. This is considered to be a measure of “peakedness”,
i.e., how “pointy” the distribution is. For reference, the kurtosis of a Gaussian
distribution is 3.0. –a wonderful item of statistical trivia. And, if you want to add to your
statistics vocabulary, distributions which depart from the Gaussian are called leptokurtic,
platykurtic, and mesokurtic, depending on the nature of their departure.
There are, of course, an infinite number of additional moments to consider. But
knowing the mean and variance is often enough to make engineering predictions and
decisions. Remember these ideas and formulas.
Now, let’s apply some of these ideas. A little while ago I mentioned the problem
of measuring the length of a soccer field with only a meter stick. Let’s explore this
problem in some detail. Let’s presume that every time you lay down the meter stick you
record your length. But, in fact, because of slight misplacements, the actual distance for
each measurement i is di = di(meas) + ei , where the measurements are in cm, and ei is a
random error. And, let’s say that the random error is –1cm 30% of the time, 0cm 40% of
the time, and +1cm 30% of the time. One meter at a time, you measure from one end of
the soccer field to the other, and you discover it’s 110.53 m long. At least, your
Uncertainty
8
Johns Hopkins University
What is Engineering?
Michael Karweit
measurements indicate that it’s 110.53 meters long. But how long is it actually? One
way to tell is by carrying out the measurement again, and again, and again.
We can simulate this problem with our statistics applet at www.jhu.edu/virtlab.
First, let the random variable x in the applet be used to represent the error. Then define
the distribution of x in terms of the individual values as
Pr(x) .3
x -1
.4
0
.3
1
This is our definition of the error every time we take measurement with the meter stick,
i.e., our measurement will be in error by –1cm, 0 cm, or 1 cm. Now, we need to see what
effect this has on our total measurement of the soccer field. That requires, say, 111
measurements. So we can a total error w, a random variable based on x, as w =
sum(x,111). This expression will take the sum of 111 realizations of the random variable
x and add them together. w will be the total error of our measurement. Because this is a
computer simulation, it is easy to carry out our “measurement” 1000 times. So, set the
number of realizations to 1000, and click on draw. What do you get? A fairly broad
distribution of errors. Recall, that this is the distribution of the sum of 111 individual
errors added together. What is the average value of this distribution? It is probably close
to zero. And what about the spread or standard deviation? That’s probably about 8 cm.
Now, click on the button “Normal curve”. This will produce a Gaussian curve with the
same mean and standard deviation as the displayed distribution. It looks pretty close. So,
the sum of 111 errors which are distributed as –1, 0, +1 is approximately a Normal
distribution. Interesting.
Usually, we are required to include some indication of accuracy when we report a
measurement. That report is usually the measurement  one standard deviation.  one
standard deviation is often called the standard error. Thus we would report that the
length of the soccer field as 110.53m  8cm.
That’s not bad accuracy for measuring the length of a soccer field— a standard
error of  8cm. But, could we do better? In the case we just calculated, a single
measurement has a standard error of  8cm. Would our measurement improve if we were
to take an average value of several measurements? What we mean by improvement is
that the standard error would be less. Again, we can use our random variable simulator to
get an answer. First, let’s assume that the distribution of errors for our total measurement
is Gaussian distributed with a mean of zero and a standard deviation of 8. That’s roughly
what we discovered in the first simulation. Let’s begin all over again, this time defining a
random variable x as being Gaussian distributed with mean zero and standard deviation 8.
Now we want to use w as the average value of a number of measurements, say 10. So,
we define w = sum(x,10)/10. sum(x,10) produces the sum of ten realizations of the total
error. Since we want the average error over those 10 realizations, we must divide by 10.
Carry out this calculation, say, 1000 times, and plot, as before.
Uncertainty
9
Johns Hopkins University
What is Engineering?
Michael Karweit
What do we get? Good news. The standard deviation of the error in taking an
average of 10 measurements is 2.5cm. What happens if we take an average over 100
measurements? Try it. The standard deviation of the error in taking an average of 100
measurements is 0.8 cm. So, the more measurements we average over, the smaller the
standard error. If you plot standard error as a function of M, the number of
measurements in the average, you will discover that the standard error is proportional to
1
.
M

, where 2 is the
M
variance of the random variable x. So, let’s do it. To make the algebra easier, lets define
a new random variable y = x - x . That means that the variance of y will be the same as
1 N 2
that of x, but we can estimate it with the more compact formula s2 =
 yi . Suppose,
N i 1
now, that M yi’s are averaged together. We’ll label them y1, y2, ... yM. Then, we could
2
1 N  y1i  y 2 i    yM i  
write var y   
  . Note, that this equation says that we will
N i 1 
M
 
take N realizations each of M samples of yi which are then averaged together. We
expand this to obtain:
1 1 N
1 N
1 N

var y  2   y1i2   y 2 i2     yM i2  
N i 1
N i 1
M  N i 1

In fact, it’s not too difficult to prove that the s.d. of x is
1

1
1
  y1i y 2 i   y1i y3i    y ( M  1) i yM i 
N
N
N

y1i  y 2 i    yM i
The first set of terms contain all the squared values of yi; the second set of terms contain
all the cross products of yi. Each of the terms in the first set of brackets is nothing other
s2
than the variance of yi. So the first set of terms reduces to
. But, what is the second
M
1 N
set? What, for example is
 y1i y 2 i ? Recall that the yi’s are random variables
N i 1
having zero mean. Most importantly, y1i is picked or sampled totally independently of
y2i , i.e., all the yi’s are independent samples. This means that the value of one yi is
uncorrelated with any other yi. So, every crossproduct of yi’s averages to zero. And all
s2
the terms in the second set of brackets is zero. The final result is then: var y 
.
M
This is the result we obtained by simulation. Actually what we’ve just said is slightly
s2
wrong. We really should say that the estimated var y 
. Or we could say
M

2
M2
Uncertainty
10
Johns Hopkins University
the var y 
What is Engineering?
Michael Karweit
2
. These appear to be subtle distinctions, but in the study of statistics, these
M
distinctions are very important. In almost all engineering applications, you will never
know  or ; they will have to be estimate from the data.
Mathematical notation and properties of averaging
Using
1
N

and subscript notation is a fairly cumbersome way to indicate
1 N
 xi . Then,
N i 1
by knowing a few simple mathematical rules, we can simplify some of our random
variable expressions considerably. In what follows, suppose that constants are
represented by upper-case letters and random variables are represented by lower-case
letters.
average value. A more convenient notation is to use the overbar, i.e., x 
Suppose we want the average value of p 
1
N
N
 (M  e ) .
i 1
i
Using our overbar
notation we would write this as p  ( M  e)  M  e  M  e . When elements are added
under an overbar, we can separate the terms into two separate averages, because the
average of a sum--in this case the average of M + e--equals the sum of the averages,
because of the associativity of addition. And, since M is a constant, M is just M .
There are also two simple rules for multiplication within an overbar. One of them
is Me  Me . That is, the average of a constant times a random variable is a constant
times the average of that random variable. The second rule of multiplication is that ef ,
the average of the crossproducts, cannot be mathematically separated. But, there are
some things we can say about its value. Suppose e and f are random variables, each with
zero mean. If e and f are statistically independent of one another, then ef = 0, i.e., the
average cross-product of independent random variables with zero means is zero. On the
other hand, if e and f have zero means, but are not independent of one another, then ef is
the covariance between them. In the special case where e and f are the same variable, we
would get the expression e 2 which is the variance of e.
This overbar notation is quite standard in those areas of engineering where the
problems themselves contain random quantities, for example, in the study of turbulence.
We'll use the notation a little further on.
The Gaussian distribution
Why is it that any time we add together a bunch of random variables, the resulting
distribution looks bell-shaped. For example, the score on each question of an exam could
Uncertainty
11
Johns Hopkins University
What is Engineering?
Michael Karweit
be considered a random variable; the total score for the exam is almost always bellshaped. If we add together 111 measurements, each of which could contain a random
error, the resulting total error has a distribution which is bell-shaped. If we count the
total number of dots on a throw of 10 dice, the distribution of dots is bell-shaped. What
is even more remarkable is that if we add together the outcomes of 100 random variables,
the sum is always bell-shaped no matter what the probability distribution is for each of
the random variables.
Run some experiments using the random function simulator. Define a random
variable x with flat distribution between 0 and 1. Construct a new random variable w as
the sum of ten realizations of x, i.e., w= sum(x,10). Obtain 1000 realizations. And, what
do you get? A distribution that is very close to Gaussian. Define a random variable y as
a Gaussian with a mean of 0 and a standard deviation of 2, and define w =sum(y,10).
Obtain 1000 realizations. And, what do you get? A distribution that is very close to
Gaussian. In fact, construct any random variable x with as wild a probability distribution
as you can think of. Define w as w=sum(x,10). And w will tend to have a Gaussian
distribution.
This remarkable result has been observed for a 100 years, but only little by little
has the observation been converted into a mathematical theorem. This theorem is the
Central Limit Theorem. In short, the central limit theorem says that if one takes the
sum of outcomes of a set of random variables (with suitable restrictions), the resulting
sum will have a Gaussian distribution. Actually, there are a number of central limit
theorems, each with its own list of restrictions. And, the topic is still open for research.
A Gaussian or normal probability distribution function has the form:
f ( x) 
1

e
( x )2
2 2
 2
Of course, if you integrate over all x, it integrates to 1.0. The most important element of
this function is that it is fully defined by the mean  and the standard deviation . None
of the other moments need be known to estimate this function. Some other properties: it
is symmetric, so we know that all odd moments are zero. And, since earlier we
mentioned the fourth moment about the mean—kurtosis--we will state here that the
kurtosis of a Gaussian is 3.0. (That’s a genuine factoid.) Sometimes this function will be
referred to as N(,)—a normal distribution with mean  and standard deviation . For
example if you read that a variable x is distributed as N(0,1), you should what that means.
Although the Gaussian is a fairly complicated function to work with, we do have
its exact functional form. This means we can learn anything we need to know about it,
either through mathematical analysis or by tabulation. Earlier I mentioned that if your
score on an exam was two standard deviations above the average, only about 2% of the
class would have a higher score than you. The reason I could make that statement is
because the total score on an exam, being a sum, is Gaussian distributed. And it has been
tabulated that 95% of the area of a Gaussian distribution lies between + 2 and -2 from
Uncertainty
12
Johns Hopkins University
What is Engineering?
Michael Karweit
the mean. Since the Gaussian is symmetric, half of the remaining 5% must lie below -2
and the other half must lie above +2.
Now, think about this question: Suppose you measure the length of a soccer field
10 times and take the average value as your best estimate. What’s the distribution of the
average value? Of course, since we’re talking about Gaussian distributions, you’ll say
“Gaussian”. (And, you’d be right.) But, think about what an average is. It’s the sum of a
sequence of values divided by a constant. Since it’s a sum, it’s variation will tend to be
Gaussian. And, we noted earlier that the variance of an average is /N, where  is the
standard deviation of each individual element and N is the number of elements in the
sum.
The present discussion should also make it a little clearer why we might denote
the quality of a measurement D by expressing it as D  , an expected value plus or
minus the standard error. Since any measurement is likely to be contaminated by any
number of contributing errors, the total error in D is likely to be Gaussian distributed.
That means that a measurement of, say 100m 10cm will incorporate the true value 68%
of the time—the area under a Gaussian curve between -1 and +1.
The Gaussian curve is certainly the most important in science and engineering.
Unfortunately, it is not universal. There are a number of random processes that generate
other probability distributions, e.g., a Poisson process.
And life is a little more complicated than we’ve led you to believe. Recall, we
don’t know  and . We can only estimate it with x and s2. So, with uncertainties in 
and , we can’t really justify the precise probabilities that we’ve mentioned. A more
thorough study of probability and statistics would show us how to deal with the problem.
ESTIMATION AND VARIANCE
Let’s talk a bit more about estimation. Estimation is the process of trying to
determine properties of a population by sampling. Almost always, this consists of taking
samples—whether it’s determining the distribution of spots on a pair of rolled dice or
determining the errors in measuring the length of a soccer field. The idea is to collect
data whose characteristics most closely match those of the population. Then, we can
calculate sample statistics which we hope will represent the population as a whole.
As we have seen, some ways are better than others. What do we mean when
we say “better”? We mean that the variance of our statistic is smaller. The goal is
always to arrive at a sample statistic whose variance is smallest. Every estimate we make
has a variance. We can even talk about the variance of the sample variance. That is,
when we estimate the sample variance of a random variable as s2, that estimate of s2 is
just one value of a distribution of possible sample variances. So, sampling strategies can
be important.
Uncertainty
13
Johns Hopkins University
What is Engineering?
Michael Karweit
In finding the length of a soccer field, for example, it wasn’t so much the
sampling strategy, but rather how to use the data. Recall, that the variance of a single
measurement was s2, but the variance of an average value of measurements was s2/M.
So, we can reduce variance by taking an average.
In some situations, we have little choice for a sampling strategy. For example, if
we want to find the distribution of outcomes from a roulette wheel, there is not much
choice but to spin and record, spin and record, spin and record.
But, there are some instances where one does have a choice. Consider the
following schematic:
B
B
B
B
A
This represents a plot of land with trees. Area A is sparsely populated with trees; areas B
are densely populated with trees. The total area is known and is very large, say,
thousands of square kilometers. The task is to estimate how many trees are in the plot.
Since there are, perhaps, tens of millions of trees, counting them is out of the question.
Sampling is the only reasonable approach. First, what will you measure? Since you
know the total area, you can sample tree density , e.g., trees per hectare (100m x 100m)
and calculate the total number of trees from that. Recall, that one of the goals of
sampling is to accurately represent the total population in the sample. In this case, the
population is the distribution of trees. And its distribution would look like this:
f()
A
B

One strategy is to randomly pick locations within the area. Then, use those as
center points about which you measure your 100m x 100m sections. Then count the trees
in each of these sections. Since, the locations are chosen randomly, you are assured of
sampling the right proportion of A and B areas. So, you expect to be able to estimate a
representative statistic: average trees per hectare. The difficulty is the statistic you
would deduce would have a fairly high variance. Is there another sampling strategy that
Uncertainty
14
Johns Hopkins University
What is Engineering?
Michael Karweit
would have a lower variance? The answer is yes. The land area consists of two
subpopulations: high tree density and low tree density. The sampling strategy we
outlined above is based on random sampling over the entire area. It turns out that we can
significantly lower the variance on our estimates if we separately sample each of the
subareas and combine their results. This is the technique of stratified sampling.
Here’s why it works. First, we’ll simplify the problem so we don’t get bogged
down in algebra. Make the assumption that half the area is of type A, and half of type B.
And, let’s denote the sample average tree density of the areas as  A and  B . Since the
areas are of equal size, we can write the average density of the whole plot as  =
(  A +  B )/2. If we define D =  -  A , then –D =  -  B . That is, D is the difference
between the global mean  and the individual means  A and  B . Now we can write
the sample variance of  as:
1 N
2
s 2     i   
N i 1
But, if we reorganize this equation to indicate from which area the samples were taken,
we would obtain:
NB
1  NA
2
2
2
s     i       i     , where NA + NB = N. But  can be expressed in
N  i 1
i 1

terms of D and the individual means. So:
1 NA

N NA
we get:
s 2 
NA
NB

i 1

N
2
2
  i   A  D   B   i   B  D   .
NB
i 1


Expanding the squared terms,



N B NB
1  N A NA
2
2




 i   B 2  2D i   B   D 2 




2
D




D



i
A
i
A

N  N A i 1
N B i 1

The first sum can be rewritten as:

N A  1 NA
2D N A
2


 i   A   D 2   N A s 2 A  D 2






i
A

N  N A i 1
N A i 1
 N
What happened to the middle term? It’s zero because it’s simply the sum of the
observations about the mean. The second sum in the previous equation gives a similar
result. So, now we can write:
s 2 


N A 2 NB 2
s 
s   D 2 . What does this say? It says that the variance of a set of
N A N B
random samples taken from the entire plot consists of three elements: a weighted
variance of samples taken from region A, a weighted variance of samples taken from
regions B, and the square of the difference between the mean tree density of the two
areas. So, even if the distribution of trees is very narrow in each of the two types of
s 2 
Uncertainty
15
Johns Hopkins University
What is Engineering?
Michael Karweit
areas, the sample variance could be large, simply because of the difference in the average
tree density between the two areas. There must be a better way! And, there is.
Rather than carrying out global random sampling, carry out stratified random
sampling. That is, sample the areas A and B separately and obtain s 2 A and s 2 B . We
have eliminated the last term in the equation for s 2 --at least almost. D is, of course,
constant; but we don’t know what that constant is. We can only estimate it from the
sample data. And that estimate will have some variation. It is that variation that we must
add to the total variance of s 2 .
s 2 
NA 2 NB 2
sA 
s  B  s D2
N
N
PROPAGATION OF ERROR
To conclude this section on random processes we’ll discuss their impact on
measurement error. Usually, we will take some measurement m̂i and presume that it
consists of the real value m and some error ei, i.e., m̂i = m + ei. And we presume that the
error ei has zero average. This means that the average value of a measurement m̂i would
be expected to equal the true value m. In other words, the average value of the ei’s is
zero. If the ei’s do not have zero average then our measurement is biased. Or we can
say that it contains a systematic error. Of course, we always try to carry out a
measurement that does not contain a systematic error.
But, just because we can carry out a measurement with an average error of zero,
does not mean that we will be free of systematic error. Depending on how we use that
measurement, it is possible to introduce one. We can illustrate this with a very simple
example. Suppose we want to estimate the area of a square. So, we measure the length
of a side, and we square that value. Just to make sure, we do this a number of times and
take an average:
1 N 2 1 N
1 N 2 2m N
1 N 2
2
ˆ
m

(
m

e
)

m

e

ei
 i N

 i N
i
N i 1
N i 1
N i 1
i 1
i 1
The right hand side of this equation has three terms. The first is the true area m2. The
second is the average of a random variable with zero mean. So, it’s zero. But the third is
the average of a random variable which is squared. So every item in the sum is positive.
This term has introduced a bias into our estimate of A, even though our measurement
error was not biased. So, we will always overestimate A.
A
The reason that we have created this bias is because we have used the
measurement (and the error) in a non-linear way. This means that the error does not
appear in the calculations just as a first power. Here, in one term, the error is squared. Is
there a way to carry out the measurement so that we don’t introduce such a problem? In
Uncertainty
16
Johns Hopkins University
What is Engineering?
Michael Karweit
general, no. But, in this case, yes. All we need to do is to take two measurements of the
square: one to measure the height; one to measure the width--even though they're
supposed to be the same length. If we do that, then we can calculate the average area as
1 N ˆ
1 N
1 N
w N
h N
1 N
ˆ
h
w

(
h

e
)(
w


)

hw

e



 i i N

 i N
 ei  i , where
i
i
i
N i 1
N i 1
N i 1
N i 1
i 1
i 1
h and w are the true height and width of the square, and ei and i are the errors in taking
those measurements. Now, if you look at the terms, all but the first average to zero.
Terms two and three average to zero because the random errors ei and i have zero
averages. Term four is the sum of eii --errors which are independent of one another. So,
the expected value of this product is zero: some terms will be positive; some will be
negative.
A
An example of a measurement which will
always have a biased error is estimating the
height of a tree by measuring out some distance
X from the tree, then measuring the angle  to
the top of the tree. Then, the height of the tree h
= X tan().

X
In N measurements, we would obtain an average height of the tree as:
N
1 N ˆ
1 N
1 N
ˆ) 1
X
tan(

(
X

e
)
tan(



)

X
tan(



)

 i


 ei tan(   i )
i
i
i
i
N i 1
N i 1
N i 1
N i 1
Here, the mathematics begins to get sticky. On average the second term is zero because it
is the product of one random error times a function of another, independent, random
error. But the average of the first time is not X tan(). It's value will depend on . And,
since there is no closed form expression which separates  from the other variables, we
can't even calculate its effect. However, if you're curious, you can experiment with the
random function simulator and see for yourself. Especially, when  is large--like greater
than 60o--the nonlinearity in the problem yields quite biased results.
h
The Calculus of errors
There's another way to estimate the effect of measurement error that has nothing
to do with probability or random processes. It involves the way a function F(x,y)
changes as x and y change--the essential ideas of calculus. We'll begin with a problem.
Suppose we want to calculate the volume of a structure that consists of a cone resting
upon a rectangular parallelopiped. The total volume of this structure is:
1
V   R 2 H c  LWH .
3
We will not measure V directly, but rather we'll calculate V by taking measurements of R,
Hc, L, W, and H. But suppose these measurements are not perfectly accurate. So the
Uncertainty
17
Johns Hopkins University
What is Engineering?
Michael Karweit
question is how much error will we introduce into our calculation of V by using
inaccurate values for the measured variables?
If each measurement is in error, then the calculated volume would consist of the
true volume V plus an error v. The relation between the error-borne measurements and
the resulting calculated volume would be:
1
V  v   ( R  r ) 2 ( H c  hc )  ( L  l )(W  w)( H  h) .
3
Expanding this equation, then subtracting out the equation for V, we get
1
v   [2 RH c r  R 2 hc  H c r 2  2 Rrh c  hc r 2 ] 
3
+ HWl  HLw  LWh  Hlw  Lhw  Whl  lwh .
First, let's look at v statistically, i.e., what would be the average value of the error v if
measurements were taken many times and averaged. Using overbar notation, we obtain
1
v   [2 RH c r  R 2 hc  H c r 2  2 R hc  hc r 2 ] 
3
 HWl  HLw  LWh  H lw  Lhw  W hl  lwh .
If hc, r, l, w, and h are all independent random variables with zero mean, then we get
1
v   H c r 2 . All the other terms average to zero.
3
Thus, we have deduced an average expected error in v. But, suppose we can't take
a lot of sample measurements, and we would like to know what is the "worst case" error.
That's fairly simple. Suppose we can estimate the maximum possible error on each
measurement. Let these be labeled rmax, lmax, hc max, hmax, and wmax. Then
1
2
2
v max   [2 RH c rmax  R 2 hc max  H c rmax
 2 Rhc max  hc max rmax
]
3
 HWl max  HLwmax  LWhmax  Hl max wmax  Lhmax wmax  Whmax l max  l max hmax wmax
A simplification of this is to assume that rmax, lmax, hc max, hmax, and wmax are all very small
compared to R, L, Hc, H, and W. Then terms containing two or more of these maximum
errors will be much smaller than terms containing only one. Consequently, if we ignore
these smaller terms, vmax can be approximated as
1
v max   (2 RH c rmax  R 2 hc max )  LWhmax  LHwmax  WHl max .
3
There's a reason for making this simplification, even though it's only an approximation to
the maximum error in v. The reason is that this simplified expression is easily deduced
for any combination of measurements. This is the result that one would obtain by taking
1
the total differential of the function V ( R, H c , L,W , H )   R 2 H c  LWH . The total
3
Uncertainty
18
Johns Hopkins University
What is Engineering?
Michael Karweit
total differential is defined like this: If F is a differentiable function depending on n
variables x1 , x 2 ,, x n , then infinitesimal variations in F are determined by infinitesimal
variations in the xis as:
F
F
F
dF ( x1 , x 2 ,..., x n ) 
dx1 
dx 2   
dx
 x1
 x2
 xn n
dF is called the “total differential” . Our variations rmax, lmax, hc max, hmax, and wmax are not
infinitesimal. But, if they're quite small, then the total differential is a pretty good
approximation to the total error vmax (that is, dF in the notation immediately above).
Notice that the value for vmax really is an error that you would never expect to
have. Not only are all of the measurement errors assumed to be at their maximum, but all
of them are contributing with the same sign. That is, there are no canceling errors. If the
measurement errors are truly random variables with zero mean, the expected error in a
calculated value of V would be much less. Nevertheless, estimating error using this
calculus can be extremely valuable, especially if you must absolutely determine some
parameter within a specified error.
Another way of representing this maximum error is by percentages. In our
example, if the total volume V is separated into its constituent pieces V = Vc + Vp , where
the subscripts c and p refer to the cone and parellelopiped, respectively, then the above
equation can be rewritten as
vmax
r
h
h
w
l
 Vc (2 max  c max )  V p ( max  max  max ) .
V
R
Hc
H
W
L
This equation shows that percentage errors in the parallelepiped measurements, e.g.,
hmax
, are linearly additive with weight Vp, whereas a percentage error in the
H
measurement of R is doubly additive with weight Vc.
V
For more information on this technique, look under “total differential” or
“calculus of errors” in elementary calculus books.
Uncertainty
19