Download uncertainty

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Johns Hopkins University
What is Engineering?
Michael Karweit
Uncertainty
Many physical processes, like the falling of a ball under the influence of gravity,
or the motion of the sun across the sky, or the acceleration of air across the surface of an
airplane wing, can be modeled explicitly as a set of equations. Then, if we want to
predict, for example, the distance a ball falls in one second under gravity when it starts
from rest, we simply plug in the value for gravity g = 9.8 m/s2 and t = 1 s, and use the
predictive equation s = ½ g t2 to obtain s = 4.6 m. At this level of sophistication, this is a
deterministic process, i.e., we can specify the parameters and initial conditions and,
then, calculate the results.
Let’s contrast this with another physical process: that of rolling a die. Now, what
is the predictive equation for the outcome? In principle, we could come up with a set of
equations but they would horrendously complicated and not at all practical. So, in effect,
we have a process whose outcome is unpredictable. When a process produces an
outcome that is almost completely unpredictable, we say it is a random process or
stochastic process.
The distinction between a deterministic process and a stochastic one is not
absolute. In our rolling die example, we could produce a set of equations to model its
motion. But the equations would depend critically on a number of parameters, whose
specifications might be almost impossible to prescribe. For example, the initial
orientation of the die, its initial release velocity and angular momentum, its height above
the table, the characteristics of the tables surface would all play critical roles in how the
die would come to rest. Even the geometry would be important—whether the edges were
sharp or rounded. Not only rounded, but how much rounded. In practice you could
never prescribe these parameters with sufficient accuracy to ensure that the equations
would predict the outcome. So, in practice, the process is random.
(But sometimes, an apparently-perfect random process is not so perfect. The
roulette wheel in gambling casinos is supposed to be perfectly random. In this apparatus
a small ball is supposed to fall with equal probability in one of the 38 numbered slots in
the wheel. But some years ago, a graduate student, Albert Hibbs, at the University of
Chicago and visiting Las Vegas, began recording the slots into which the ball fell on one
roulette wheel. After several days, he realized that the wheel favored some slots over
others, i.e., it was imperfectly made. So with some strategic wagering he was able to
“beat the odds”, and, additionally, made a name for himself.)
How completely we model a physical process will determine how well we can
predict behavior. Usually we try to include those parameters that have large and
systematic influences. That is, we try to include parameters that affect the process in a
significant and expected way. In our first example of a falling ball, our model was the
simple relationship s = ½ g t2 . One of the parameters neglected in this model is air
resistance. If we were to use this simple equation as a predictor of an actual experiment,
we would always overestimate distance, because the neglected ingredient always acts to
slow the ball. That is, the element that we neglected had a systematic effect.
4/23/01
Uncertainty
1
Johns Hopkins University
What is Engineering?
Michael Karweit
At some point we stop trying to include additional parameters in our model
because, the model becomes too cumbersome. And we hope that the remaining
parameters will have a relatively small influence on prediction. Further, we hope that
these unmodeled parameters will partially cancel one another out, i.e., they won’t all act
to affect the process in one direction. (Eventually, there is a limit beyond which
deterministic modeling is not possible. You’ve probably heard of the Heisenberg
uncertainty principle.) Often a process is considered mostly deterministic or mostly
random based on practicality. Is it worth it to expand the model to include additional
details?
Let’s take the example of water pressure in a municipal water system. If, at some
point within the system, you were to measure pressure as a function of time you would
discover it fluctuates, sometimes smoothly, sometimes wildly. In principle, one could
model the system and make predictions from it, but the system is so complex that it
would not be practical to do so in detail. Pump pressure, pipe diameter, and pipe
roughness could be relatively easily included in a model. Those parameters are relatively
constant. But water pressure also depends on flow rate. So, every time someone turns on
a faucet or flushes a toilet, it has an effect. And it would be impossible to monitor all
faucets in the system to be able to include their individual impacts on the system. So,
what to do?
One answer is to predict “in the average”. That is, make some assumptions about
the distribution of faucets turned on and toilets flushed by time of day, and use and a
measure or statistic to characterize a “typical” situation, perhaps by time of day or day of
week. Then use this statistic to approximate the actual conditions in the model.
Predictions resulting from an averaged input will be imperfect, but they could be
reasonably useful. (Actually one of the more interesting problems for water-works
engineers and one that defies getting usable predictions from “averaged” inputs is the
Superbowl problem. In a period of about twenty minutes during Superbowl half-time, 50
million toilets are flushed throughout the U.S.—a once a year occurrence. This wreaks
havoc with municipal water systems.)
Uncertainty and randomness enter the engineering world in yet another way:
measurement. If fifty people were given meter sticks and you asked them to measure the
length of a soccer field, you would get fifty different answers. They might be closely
clustered, but they would be different. Why? The soccer field isn’t changing in size.
The answer is that errors are introduced in taking the measurement. Maybe the meter
sticks are not all exactly one meter long. Maybe the tick marks on the meter sticks were
read incorrectly. Maybe the number of meter stick lengths along the field was
miscounted. Maybe the meter sticks were not laid in a straight line. There are a lot of
“maybes”.
So, with fifty different answers, how long is the soccer field? Really, we can’t
tell. But we can estimate its most likely length by taking the average value of all the
measurements. So, let’s say that average value is 112 m. How confident are we that the
4/23/01
Uncertainty
2
Johns Hopkins University
What is Engineering?
Michael Karweit
actual length is 112m? If all fifty measurements lie between 111.5m and 112.5m, we’re
pretty confident. But, if the fifty measurements lie between 100m and 120m, we would
be far less sure. So, the spread or distribution of values makes a difference in the
confidence of our estimate. Can we actually quantify measurement confidence? How
can we deal with non-deterministic quantities? How can we characterize random
processes or distributions of outcomes? These are all questions vital to engineering. And
they are addressed in terms of probability and statistics.
DISTRIBUTIONS
With random processes we can never predict a specific outcome—that’s what
makes it random. But we might be able to deduce the likelihood that a particular
outcome will occur. That can be very helpful. But determining this likelihood requires
knowledge of the distribution of possible outcomes. Sometimes we can infer what the
distribution is; other times we cannot. In the case of a “perfect” die, we presume that
each side of the die is equally likely to land face up. So, 1/6th of the time we would
expect to find a “2” face up, for example. And the same would be true for each of the
other numbers. Another way of saying it is that the probability of getting a “2” on any
one roll of the die would be 1/6 or 0.16667.
Probability is the likelihood that an event will occur, or a particular outcome will
occur. Probabilities always lie between 0.0 and 1.0. If the probability of an event is 0.0,
that means it will never occur. If the probability of an outcome is 1.0, that means it is
certain to occur.
What is the probability that a “1” or a “2” or a “3” or a “4” or a “5” or a “6” will
occur in one roll of the die? Since this event encompasses every possible outcome, its
probability must be 1.0 (presuming that the die cannot end up on an edge or corner).
That is, the sum of probabilities of all possible outcomes is 1.0. This fact allows us to
define a probability distribution function f(n), where n is a particular outcome. For a
rolled die f(1) = f(2) = f(3) = f(4) = f(5) = f(6) = 0.16667. A plot of this function looks
like this:
f(n)
0.16667
n
1
2
3
4
5
6
This is a uniform distribution or flat distribution, i.e., each outcome is equally
likely to occur. And, since we have scaled the values so that the sum of the heights of the
rectangles: 6 * 0.16667 = 1.0, this plot can be thought of as a probability distribution
4/23/01
Uncertainty
3
Johns Hopkins University
What is Engineering?
Michael Karweit
n N
function. Mathematically it can be written as
 f (n) 1.0 .
This is a property of
n 1
probability distribution functions.
An event can also be more complicated—for example the rolling of two dice.
Then we might define the outcome as the sum of the spots on the two dice. In this case
there are 6 x 6 = 36 possible ways the dice can land, each equally likely. But in those 36
ways, there are only 11 possible outcomes: the values 2 through 12. But each of these
values is not equally likely to occur. There is only one way to obtain a 2—when both
dice show a “1”. But there are four ways of obtaining a 5: (1,4), (4,1), (2,3), (3,2). So, if
every combination of faces is equally likely to occur, one would expect a 5 to occur
4/36th s of the time and a 1 to occur 1/36th of the time. Again, it’s useful to plot the
probability distribution function:
f(n)
0.16667
n
2 3 4 5 6 7 8 9 10 11 12
And, again, because this plot includes every possible outcome, the sum of the heights of
the rectangles is one.
The rolling of dice is an example of a random process which produces discrete
outcomes, i.e., outcomes which one can enumerate. There are also random processes
which produce non-denumerable or continuous outcomes, e.g., the angle at which a
spinner stops turning. Here, the probability that the spinner will stop at, say,
30.0123456 is almost zero. The reason is that 30.0123456 is only one of an infinite
number of possible values. So, how can this be useful? The answer is, that if we specify
a range of values over which the spinner may stop, then the probability becomes finite.
For example, the probability that the spinner will stop between 30 and 31 is 1/360.
With continuous outcomes, we no longer speak of probability distribution
functions, but rather of probability density functions. And, we can no longer scale the
N
sum of all possible outcomes f(n) as
 f (n) 1.0 because we cannot enumerate
n 1
individual outcomes. But we can write the equivalent expression using calculus. Let x
be a continuum of outcomes and f(x) be the probability density function of the occurrence
4/23/01
Uncertainty
4
Johns Hopkins University
What is Engineering?
Michael Karweit

of x; then, over all possible values of x, the integral
 f ( x)dx  1.0.
The probability

density function for the spinner is uniform and would be plotted as follows:
In the case of discrete
outcomes, the sum of heights of the
rectangles must add to 1.0. In the
case of continuous outcomes, the
area under the curve must equal 1.0.
Then, the probability of obtaining a
value between x = a and x = b is
the area under the curve between a
and b.
f(x)
1/360
x
0
360
n and x in the discussion above are called random variables. Their values are
distributed according to the generating random process.
Depending on the underlying random process, probability distributions or density
functions can take on many forms. One of the more well-known is this one:
f(x)
x
This is often called a bell-shaped distribution. More formally, it is known as a
Gaussian distribution or normal distribution. It can be skinny or wide, but the area
under the curve is always 1.0. This function is extremely important in engineering.
We’ll discuss it in detail a little later.
One of the difficulties in studying random processes is that we almost never know
what the probability density function (pdf) is. Sometimes we have a general idea about
its shape, but not much more. In fact, to learn more, we usually infer its characteristics
by taking sample outcomes. From these data, we try to estimate its form. Remember,
there is some underlying process whose outcomes are probabilistically distributed. It is
the characteristics of that underlying process that we are trying to determine. (In the case
of Albert Hibbs, everyone’s initial expectation was that the roulette wheel had a flat or
uniform distribution function, i.e., each number on the wheel was equally likely to occur.
But Hibbs collected data and concluded that the distribution function was not perfectly
uniform—some numbers were more likely to occur than others. Based on his estimate of
the wheel’s distribution function, he was able to improve his odds of winning.)
4/23/01
Uncertainty
5
Johns Hopkins University
What is Engineering?
Michael Karweit
Deducing pdfs, however, is not easy, because they don’t necessarily have
analytical forms, i.e., they may not be explicitly expressible as mathematical functions.
Nevertheless, we can learn something about a pdf’s characteristics by exploring its
moments. Every pdf can be expressed in terms of an infinite number of parameters called
moments. Moments characterize probability distribution functions just like Taylor series
polynomials characterize mathematical functions. But they characterize them in a
different way. Moments characterize pdf’s in terms of shape—spread, symmetry,
peakedness, etc. Taylor series polynomials characterize math functions in terms of
curvature: linear, quadratic, cubic. Different techniques for exploring the different
properties of functions.
In general, moments denote the effect of something which is applied at a distance.
For example, in physics there is the concept of torque—the twisting effect of applying a
force at the end of a lever arm. If the force is applied at right angles to the lever arm,
then the torque T = r * F, the length of the lever arm r, and the magnitude of the force F.
Torque is a moment.
In probability and statistics the idea is the same, except that the “something” is an
outcome, and the “distance” is how far that outcome is from zero or an average value.
Statistical moments characterize the shape of the distribution. For discrete and
continuous random processes, respectively, the Pth moment is defined as:
N
m P   f ( xi )( xi   ) P , m P 
i 1
N

i 1

where    f ( xi ) xi and  

 f ( x)( x   )
P
dx

 f ( x) xdx , respectively.
For the discrete cases, recall that
N is the total number of possible outcomes.
mP is called the Pth moment about the mean.  is the mean or the average value of
x. In fact,  is the 1st moment about zero. Each mP emphasizes a feature in the
distribution of f(x). So, if we knew the values of all the mPs, we could deduce the actual
probability density function f(x). Knowing all the moments tells us everything there is to
know about f(x). But, there are two problems: first, there are an infinite number of
moments; second, all we will have at our disposal is a set of sample outcomes produced
by the underlying random process.
It turns out that the first problem is not so serious, because in most applications,
almost everything we would like to know about a pdf is contained in the first several
moments. In fact, only rarely are we interested in more than the first four moments. The
second problem is a little more serious, because the best we can do is estimate what those
moments are. We will never be able to really know they are, but with enough sampling,
we can estimate them arbitrarily closely.
Let’s start with a concrete example. Suppose we want to characterize the
distribution of defective screws coming off an assembly line. They’re packaged 1000
screws per box. So, how do we proceed? Maybe we decide to count the number of
4/23/01
Uncertainty
6
Johns Hopkins University
What is Engineering?
Michael Karweit
defective screws in N=100 boxes—the boxes selected randomly. (Note, that we will now
use N to denote the number of samples, not the number of possible outcomes, as before.)
So, we have a list of xi ‘s or defective screws per box for i = 1,N. First, we’re interested
in the 1st moment about zero, i.e., the mean. On average, how many screws per box are
1 N
defective? To find the mean, we use the formula: x   xi .What’s going on here?
N i 1
Why is the mean denoted by x and not  ? Second, why is the formula so different from
what was given before? (Actually, this is the equation that probably looks most familiar.)
The answer to the first question is that we can’t calculate . It’s a property of the
underlying pdf. We can only estimate  from our sample data. (There’s a lot more than
100 boxes of screws coming off the assembly line.) We denote that estimate as x . Our
expectation is that x is close to . In fact, we’ll even be able to calculate how close it’s
likely to be. Now the second question. In our original formula for calculating , we
considered all possible values of x, and we “weighted” them by their probability of
occurring. In adding up (or integrating) all weighted values, we arrived at . In
estimating x , however, the probability of getting a certain value of x has already been
taken into account by the sampling procedure. x’s with low probabilities of occurring,
are not found very often in the sample. So, with the relative distribution of x’s already
accounted for, adding the unweighted samples is equivalent. And, since we’re interested
in the “per sample” average of x, we divide that sum by N.
So, now we have an estimate of the mean number of defective screws per box.
Maybe we’d to know how that mean comes about. Do all boxes have 10.0 defective
screws? Or maybe most boxes have no defective screws, while a few boxes have many.
These are questions about the distribution of defects. To illustrate, here are some pdfs, all
with the same mean  = 10.0.
f(x)
f(x)
f(x)
f(x)
10
x
10
10
x
10
x
10
0
From the point of view of our boxes of screws, these graphs represent the following
situations: a) almost all boxes have exactly 10 defective screws; b) boxes tend to have
roughly 5 or roughly 15 defective screws, but hardly anything else; c) many boxes have
about 9 defective screws, but the number can vary from one to very many. These are
quite different quality characteristics, yet they all have an average value of 10.
We get additional useful information with m2, the second moment about the
mean. This moment also has a special name: the variance. It’s a measure of the
“spread” of the distribution. In our example of defective screws, this statistic would tell
4/23/01
Uncertainty
7
Johns Hopkins University
What is Engineering?
Michael Karweit
us how much variation there could be from box to box. Here the calculation is
1 N
s 2   ( xi  x ) 2 , where s2 is a sample estimate of the true variance 2 . Again, we can
N i 1
only estimate the true variance. Often we’re most interested in s 2 , the square-root of
the variance. This is called the standard deviation (s.d.) or standard error, depending
on the application. Notice that s 2 has the same units as x, i.e., if x is in meters, then
the standard deviation is in meters; if x is the number of defective screws, then, the
standard deviation is the number of defective screws.
One can define lots of different variances and they all have the same meaning—
amount of spread. To specify which variance, one can use the notation var(). For
example, to specify the variance of a random variable x as we did immediately above,
one can write var (x). But, one can also calculate the variance of an estimate, say x , and
denote it as var ( x ). We might also want to know how much variation there is in our
estimate of the variance of x . That would be denoted as var ( var ( x )).
You probably already know the use of standard deviation: Suppose you’ve
received a score of 55 on an exam. Is that good or bad? What’s the first question you
ask: “What was the class average?” Suppose the answer is 50. Now, at least you know
that you did better than average. But how much better? Here’s where you ask your next
question: “What was the standard deviation?”. Why the standard deviation? Because
the standard deviation indicates the spread of scores. And the standard deviation is
especially useful for exam scores because the distribution of scores on an exam is often
“bell-shaped” or “Gaussian”. And a Gaussian distribution is completely characterized by
its mean and standard deviation. (We’ll elaborate on this later.) So, if you know the
standard deviation you can estimate quite precisely how well you did with respect to the
rest of the class. For example, if the standard deviation is 10.0, that means that you are
only 0.5 standard deviations above the mean. Assuming a Gaussian distribution, that
means that approximately 31% of the class did better than you. Your score is OK, but
not great. On the other hand, if the standard deviation is 2.5, your score is 2.0 s.d.’s
above the class average. That means only about 2% of the class did better than you.
That’s terrific.
In the example of defective screws, you would use the standard deviation in a
similar way. You might report your defective rate as 10.0  2.3 per box, where the “2.3”
is the standard deviation. This characterization, then, relates not only the average rate,
but also suggests the variability of the rate of defects.
But not all distributions are Gaussian, which means that more moments are useful
in characterizing the distribution function. Two of these moments are m3 and m4. By
themselves are not so useful, because their magnitudes depend on the choice of units of
the random variable x—in fact as x3 and x4, respectively. To make these moments more
representative of the shape of the distribution function, it is customary to nondimensionalize them, i.e., normalize them with respect to another parameter—the
4/23/01
Uncertainty
8
Johns Hopkins University
What is Engineering?
Michael Karweit
variance. Normalizing these two moments gives us the non-dimensional statistics
skewness and kurtosis:
skewness =
1
N
N
 (x
i 1
i
 x)3
(s 2 ) 3 / 2
, and kurtosis =
1
N
N
 (x
i 1
i
 x)4
(s 2 ) 2
.
Notice that the values have no units whatsoever. That is, they are unit independent. So if
your data were converted, say, from millimeters to kilometers, the result would be the
same.
Skewness is a measure of symmetry. A distribution with zero skewness will tend
to be symmetric about the mean. If the skewness is non-zero, the magnitude of the
skewness indicates how lopsided the distribution is. Notice, that we wouldn’t be able to
make that interpretation if only m3 were used, because different units for x would
produce different values for m3 even if they came from the same underlying distribution.
Non-dimensionalizing parameters is a very useful practice in statistics in particular, and
in engineering in general.
Finally, we have kurtosis. This is considered to be a measure of “peakedness”,
i.e., how “pointy” the distribution is. For reference, the kurtosis of a Gaussian
distribution is 3.0. –a wonderful item of statistical trivia. And, if you want to add to your
statistics vocabulary, distributions which depart from the Gaussian are called leptokurtic,
platykurtic, and mesokurtic, depending on the nature of their departure.
There are, of course, an infinite number of additional moments to consider. But
knowing the mean and variance is often enough to make engineering predictions and
decisions. Remember these ideas and formulas.
In our example above, we estimated the mean number  of defective screws as
x =10.0. How good an estimate is that? What we mean by “good”, is what is the
variance of the estimate of x . Actually, if we know the variance of xi, we can deduce the
variance of x . Here’s how.
To make the algebra easier, lets define a new random variable y, = x, - x . That
means that the variance of y will be the same as that of x, but we can estimate it with the
1 N 2
more compact formula s2 =
 yi . Suppose, now, that M yi’s are averaged together,
N i 1
each being sampled N times. We’ll label them y1, y2, ... yM. Then, we can write
2
1 N  y1i  y 2i    yM i  
var ( y )   
  , that is N realizations each of M samples of yi
N i 1 
M
 
which are then averaged together. We expand this to obtain:
4/23/01
Uncertainty
9
Johns Hopkins University
var ( y ) 
1
M2
1
N

What is Engineering?
N
 y1
i 1
2
i

1
N
N
 y2
i 1
2
i

1
N
Michael Karweit
N
 yM
i 1
2
i



1

1
1
  y1i y 2i   y1i y3i    y ( M  1) i yM i 
N
N
N

y1i  y 2 i    yM i
The first set of terms contain all the squared values of yi; the second set of terms contain
all the cross products of yi. Each of the terms in the first set of brackets is nothing other
s2
than the variance of yi. So the first set of terms reduces to
. But, what is the second
M
1 N
set? What, for example is
 y1i y 2 i ? Recall that the yi’s are random variables
N i 1
having zero mean. Most importantly, y1i is picked or sampled totally independently of
y2i , i.e., all the yi’s are independent samples. This means that the value of one yi is
uncorrelated with any other yi. So, every crossproduct of yi’s averages to zero. And all
s2
the terms in the second set of brackets is zero. The final result is then: var ( y ) 
.
M
s2
2
We really should say that the estimated var ( y ) 
. Or we could say the var ( y ) 
.
M
M
These appear to be subtle distinctions, but in the study of statistics, these distinctions are
very important. In almost all engineering applications, you will never know  or ; they
will have to be estimated from the data.

2
M2
So, why do we care about this result? Look carefully. It says that an average
measurement has less variance than a single measurement. In fact, the variation is
reduced by the factor 1/M. If you wanted to know the width of a soccer field, intuitively
you might make three measurements and take the average value. Now you know why.
Your intuition told you that an average is a better estimate than a single measurement.
Here, we’ve demonstrated what that improvement is. Since we’re usually interested in
the standard deviation or standard error of a measurement, our improvement by taking
averages is proportional to 1
. So, I should now be able to ask you the following
M
question: If you know that the variability of defective screws per box is s2 = 5.0, then
how many boxes must you sample to estimate the average number of defectives to within
0.1? That is, how big must M be so that the s.d. of the average number of defects x is
less than 0.1? Questions like this arise in science and engineering very frequently.
Random processes, distribution functions, moments, uncertainty, etc., are
concepts that can be hard to grasp. And theory doesn’t always provide insight.
Fortunately, we can get a feel for some of these ideas using a random process simulator.
We’ll experiment with the simulator in our virtual lab at www.jhu.edu/virtlab/stats/statistics.htm
Awhile back I mentioned the task of measuring the length of a soccer field with
only a meter stick. Let’s think about that situation in some detail. Let’s presume that
4/23/01
Uncertainty
10
Johns Hopkins University
What is Engineering?
Michael Karweit
every time you lay down the meter stick you introduce a random placement error ei. And,
let’s say that the random error is –1cm 30% of the time, 0cm 40% of the time, and +1cm
30% of the time. One meter at a time, you measure from one end of the soccer field to
the other, and you discover it to be 112.60 lengths of the meter stick to span the field, i.e.,
you think the field is 112.60m long. At least, your measurements indicate that it’s 112.60
meters long. But with each measurement, you’ve introduce a possible error of  1 cm.
And these have accumulated 113 times—one for each time you placed the meter stick.
113
How has this accumulation of errors
e
i 1
i
affected your result? One way of estimating
this is by carrying out the measurement again, and again, and again.
This is a process in uncertainty that we can simulate in our virtual lab. First, let
the random variable x in the simulation be used to represent the error ei. Then define the
distribution of x in terms of “individual values” as
Pr(x) .3
x -1
.4
0
.3
1
This is our definition of the error every time we take measurement with the meter stick,
i.e., our measurement will be in error by –1cm, 0 cm, or 1 cm. First, let’s determine the
standard deviation of this error. Do this by setting w = x. That is, our final random
variable is nothing other than measurement error itself. Then, set the number of
realizations to 1000—that is, you want to get 1000 separate values of this random
variable. Then click on “draw”. What you’ll get is a distribution that consists of three
values: -1, 0, +1. Not so surprising. And their frequencies of occurrence should be
about 3 to 4 to 3. You should also obtain a calculated mean of approximately zero, and a
standard deviation of about 0.77. These two numbers partially characterize the nature of
the error in taking a single measurement with the meter stick.
Now, we need to see what effect this error has on our total measurement of the soccer
field. That requires, say, 113 measurements. So we can form a new random variable w,
which is the sum of 113 values of the random error x, as w = sum(x,113). This
expression will take the sum of 113 realizations of the random error x and add them
together. w will be the total error of our measurement. Again, we want to carry out this
“measurement” 1000 times. So, set the number of realizations to 1000, and click on
draw. What do you get? A fairly broad distribution of errors. Recall, that this is the
distribution of the sum of 113 individual errors added together. What is the average
value of this distribution? It is probably close to zero. And what about the spread or
standard deviation? That’s probably about 8 cm. Might we have predicted that 8 cm?
Yes. Here’s how.
The standard deviation of an average is 1
times the standard deviation of the
M
individual random variable. In this case we’re summing the random error 113 times.
That’s the same as taking its average, except we’re not dividing by the total number of
elements in the sum. So, we might predict that the standard deviation of this sum would
4/23/01
Uncertainty
11
Johns Hopkins University
What is Engineering?
Michael Karweit
be M times the s.d. of the individual error or 113 *0.77 = 8.18. That’s just about
right. So the theory does work. (Or, if you’re a skeptic, you might say the simulation
works.)
Now, click on the button “Normal curve”. This will produce a Gaussian curve with the
same mean and standard deviation as the displayed distribution. It looks pretty close. So,
the sum of 113 errors which are distributed as –1, 0, +1 is approximately a Normal
distribution. Interesting. Would you have expected anything else?
Usually, we are required to include some indication of accuracy when we report a
measurement. That report is usually the measurement  one standard deviation.  one
standard deviation is often called the standard error. Thus we would report that the
length of the soccer field as 112.60m  8cm.
But that value of 112.60m is based on a single measurement. What might be a
better estimate? An average, of course. What we mean by improvement is that the
standard error is less. For one measurement the standard error is  8cm (based on the
error simulation). Let’s see what happens if we take a number of measurements. Again,
we can use our random variable simulator to get an answer. First, let’s assume that the
distribution of errors for our total measurement is Gaussian distributed with a mean of 0
cm. and a standard deviation of 8 cm. That’s roughly what we discovered in our first
simulation. Let’s begin all over again, this time defining a random variable x as being
Gaussian distributed with mean 0 and standard deviation 8.
Now we want to use w as the average value of a number of measurements, say 10. So,
we define w = sum(x,10)/10. sum(x,10) produces the sum of ten realizations of the total
error. Since we want the average error over those 10 realizations, we must divide by 10.
Carry out this calculation, say, 1000 times, and plot, as before.
What do we get? Good news. The standard deviation of the error in taking an
average of 10 measurements is 2.5cm. What happens if we take an average over 100
measurements? Try it. The standard deviation of the error in taking an average of 100
measurements is 0.8 cm. So, the more measurements we average over, the smaller the
standard error. If you plot standard error as a function of M, the number of
measurements in the average, you will discover that the standard error is proportional to
1
, just as we deduced. Plot it. See how close it really is.
M
Mathematical notation and properties of averaging
1 N
 and subscript notation is a fairly cumbersome way to indicate
N i 1
average value. We can often work with just the overbar notation itself, like x . By
knowing a few simple mathematical rules, we can reduce summation expressions
Using
4/23/01
Uncertainty
12
Johns Hopkins University
What is Engineering?
Michael Karweit
directly. In what follows, constants are represented by upper-case letters and random
variables are represented by lower-case letters.
Suppose we want the average value of p 
1
N
N
 (M  e ) .
i 1
i
Using our overbar
notation we would write this as p  ( M  e)  M  e  M  e . When elements are added
under an overbar, we can separate the terms into two separate averages, because the
average of a sum--in this case the average of M + e--equals the sum of the averages,
because of the associativity of addition. And, since M is a constant, M is just M .
There are also two simple rules for multiplication within an overbar. One of them
is Me  Me . That is, the average of a constant times a random variable is a constant
times the average of that random variable. The second rule of multiplication is that ef ,
the average of the crossproducts, cannot be mathematically separated. But, there are
some things we can say about its value. Suppose e and f are random variables, each with
zero mean. If e and f are statistically independent of one another, then ef = 0, i.e., the
average cross-product of independent random variables with zero means is zero. On the
other hand, if e and f have zero means, but are not independent of one another, then ef is
the covariation between them. In the special case where e and f are the same variable, we
would get the expression e 2 which is the variance of e.
This overbar notation is quite standard in those areas of engineering where the
problems contain statistical or random quantities.
The Gaussian distribution
Why is it that any time we add together a bunch of random variables, the resulting
distribution looks bell-shaped. For example, the score on each question of an exam could
be considered a random variable; the total score for the exam is almost always bellshaped. If we add together 111 measurements, each of which could contain a random
error, the resulting total error has a distribution which is bell-shaped. If we count the
total number of dots on a throw of 10 dice, the distribution of dots is bell-shaped. What
is even more remarkable is that if we add together the outcomes of 100 random variables,
the sum is always bell-shaped no matter what the probability distribution is for each of
the random variables.
Run some experiments using the random function simulator. Define a random
variable x with flat distribution between 0 and 1. Construct a new random variable w as
the sum of ten realizations of x, i.e., w= sum(x,10). Obtain 1000 realizations. And, what
do you get? A distribution that is very close to Gaussian. Define a random variable y as
a Gaussian with a mean of 0 and a standard deviation of 2, and define w =sum(y,10).
Obtain 1000 realizations. And, what do you get? A distribution that is very close to
Gaussian. In fact, construct any random variable x with as wild a probability distribution
4/23/01
Uncertainty
13
Johns Hopkins University
What is Engineering?
Michael Karweit
as you can think of. Define w as w=sum(x,10). And w will tend to have a Gaussian
distribution.
This remarkable result has been observed for a 100 years, but only little by little
has the observation been elevated into a mathematical theorem. This theorem is the
Central Limit Theorem. In short, the central limit theorem says that if one takes the
sum of outcomes of a set of random variables (with suitable restrictions), the resulting
sum will have a Gaussian distribution. Actually, there are a number of central limit
theorems, each with its own list of restrictions. And, the topic is still open for research.
A Gaussian or normal probability distribution function has the form:
f ( x) 
1
 2

e
( x )2
2 2
Of course, if you integrate over all x, it integrates to 1.0. The most important element of
this function is that it is fully defined by the mean  and the standard deviation . None
of the other moments need be known to estimate this function. Some other properties: it
is symmetric, so we know that all odd moments are zero. And, since earlier we
mentioned the fourth moment about the mean—kurtosis--we will state here that the
kurtosis of a Gaussian is 3.0. (That’s a genuine factoid.) Sometimes this function will be
referred to as N(,)—a normal distribution with mean  and standard deviation . For
example if you read that a variable x is distributed as N(0,1), you should what that means.
Although the Gaussian is a fairly complicated function to work with, we do have
its exact functional form. This means we can learn anything we need to know about it,
either through mathematical analysis or by tabulation. Earlier I mentioned that if your
score on an exam was two standard deviations above the average, only about 2% of the
class would have a higher score than you. The reason I could make that statement is
because the total score on an exam, being a sum, is Gaussian distributed. And it has been
tabulated that 95% of the area of a Gaussian distribution lies between + 2 and -2 from
the mean. Since the Gaussian is symmetric, half of the remaining 5% must lie below -2
and the other half must lie above +2.
Now, think about this question: Suppose you measure the length of a soccer field
10 times and take the average value as your best estimate. What’s the distribution of the
average value? Of course, since we’re talking about Gaussian distributions, you’ll say
“Gaussian”. (And, you’d be right.) But, think about what an average is. It’s the sum of a
sequence of values divided by a constant. Since it’s a sum, it’s variation will tend to be
Gaussian. And, we noted earlier that the variance of an average is /N, where  is the
standard deviation of each individual element and N is the number of elements in the
sum.
The present discussion should also make it a little clearer why we might denote
the quality of a measurement D by expressing it as D  , an expected value plus or
minus the standard error. Since any measurement is likely to be contaminated by any
number of contributing errors, the total error in D is likely to be Gaussian distributed.
4/23/01
Uncertainty
14
Johns Hopkins University
What is Engineering?
Michael Karweit
That means that a measurement of, say 100m 10cm will incorporate the true value 68%
of the time—the area under a Gaussian curve between -1 and +1.
The Gaussian curve is certainly the most important one in science and
engineering. Unfortunately, it is not universal. There are a number of random processes
that generate other probability distributions, e.g., a Poisson process.
And life is a little more complicated than we’ve led you to believe. Recall, we
don’t know  and . We can only estimate it with x and s2. So, with uncertainties in 
and , we can’t really justify the precise probabilities that we’ve mentioned. A more
thorough study of probability and statistics would show us how to deal with the problem.
ESTIMATION AND VARIANCE
Let’s talk a bit more about estimation. Estimation is the process of trying to
determine properties of a population or event by sampling—whether it’s determining the
distribution of defective screws in a box or determining the length of a soccer field. The
idea is to collect data in a way which will allow us to most closely estimate properties of
the population or event. “Most closely” means with the least variance.
In finding the length of a soccer field, for example, it wasn’t so much the
sampling strategy, but rather how to use the data. Recall, that the variance of a single
measurement was s2, but the variance of an average value of measurements was s2/M.
So, we can reduce variance by taking averages. But, there are other ways as well.
In some situations, we have little choice for a sampling strategy. For example, if
we want to find the distribution of outcomes from a roulette wheel, there is not much
choice but to spin and record, spin and record, spin and record.
But, there are some instances where one does have a choice. And the choice one
makes, can have a significant effect. Consider the following schematic:
B
B
B
B
A
This represents a plot of land with trees. Area A is sparsely populated with trees; areas B
are densely populated with trees. The total area is known and is very large, say,
thousands of square kilometers. The task is to estimate how many trees are in the plot.
4/23/01
Uncertainty
15
Johns Hopkins University
What is Engineering?
Michael Karweit
Since there are, perhaps, tens of millions of trees, counting them is out of the question.
Sampling is the only reasonable approach. First, what will you measure? Since you
know the total area, you can sample tree density , e.g., trees per hectare (100m x 100m)
and calculate the total number of trees from that. Recall, that one of the goals of
sampling is to accurately represent the total population in the sample. In this case, the
population is the collection of trees. And its distribution would look like this:
f()
A
B

One strategy is to randomly pick locations within the entire area. Then, use those
locations as center points about which you measure 100m x 100m sections. Then count
the trees in each of these sections. Since, the locations are chosen randomly, you expect
to sample the correct proportion of A and B areas. So, you can expect to estimate a
representative statistic: average trees per hectare. The difficulty is the statistic you
would deduce would have a fairly high variance. Is there another sampling strategy that
would have a lower variance? The answer is yes. The land area consists of two subpopulations: high tree density and low tree density. The sampling strategy we just
outlined is based on random sampling over the entire area. It turns out that we can
significantly lower the variance on our estimates if we separately sample each of the subareas and combine their results. This is the technique of stratified sampling.
Here’s the theory. First, we’ll simplify the problem so we don’t get bogged down
in algebra. Make the assumption that half the area is of type A, and half of type B. And,
let’s denote the sample average tree density of the areas as  A and  B . Since the areas
are of equal size, we can write the average density of the whole plot as  = (  A +  B )/2.
If we define D =  -  A , then –D =  -  B . That is, D is the difference between the
global mean  and the individual means  A and  B . First, we’ll do the problem as if
we’re doing a simple random sample over the entire area. But we’ll develop it to
acknowledge that there are these two different sub-areas. We write the sample variance
of  as:
1 N
2
var (  )   i   
N i 1
If we reorganize this equation to indicate from which area the samples were taken, we
obtain:
NB
1  NA
2
2
var (  )   i      i     , where NA + NB = N. Now,  can be
N  i 1
i 1

expressed in terms of D and the individual means. So:
4/23/01
Uncertainty
16
Johns Hopkins University
What is Engineering?
Michael Karweit

1  NA NA
NB N B
2


i   B  D2  . Expanding the squared




D



i
A

N  N A i 1
N B i 1

terms, we get:
var (  ) 





1  NA NA
NB NB
2
2




i   B 2  2Di   B   D 2 




2
D




D




i
A
i
A
N  N A i 1
N B i 1

The first sum can be rewritten as:

NA  1 NA
2D N A
2


i   A   D2   N A var (  A )  D2






i
A

N  N A i 1
N A i 1
 N
What happened to the middle term? It’s zero because it’s simply the sum of the
observations about the mean. The second sum in the previous equation gives a similar
result. So, now we can write:
var (  ) 


NA
N
var (  A )  B var (  B )  D 2 . What does this say? It says that the variance of
N
N
a set of random samples taken from the entire plot consists of three elements: a weighted
variance of samples taken from region A, a weighted variance of samples taken from
regions B, and the square of the difference between the mean tree density of the two
areas. So, even if the distribution of trees is very narrow in each of the two types of
areas, the sample variance could be large, simply because of the difference in the average
tree density between the two areas.
var (  ) 
We’re not quite done. we’re interested in how well we can estimate the total
number of trees; and that is characterized by the variance of  . We know that the
variance of an average value is related to the variance of the random variable itself as
1 N
N

var(  ) = var ()/N, so we deduce that var (  )   A var (  A )  B var (  B )  D 2  .
N N
N

Rather than carrying out global random sampling, one can carry out
stratified random sampling to obtain another estimate of  , say,   . That is, sample the
areas A and B separately and obtain values for var(A) and var(B). The statistic we
want is   = (  A +  B )/2, and its variance. Since  A is unrelated to  B , the variances


1
var (  A )  var (  B )  1  var (  A )  var (  B )  . By
4
4  NA
NB 
2
evaluating the area as two sub-areas, we’ve eliminated the D term. That’s our
improvement using stratified sampling. If you want to see the value of this technique, try
the tree-counting simulation at www.jhu.edu/virtlab/trees/howmany.htm .
add, and we get var (  ) 
Note: The expressions for var (  ) and var (  ) are not quite parallel, because in
our derivation of var (  ) we explicitly toke into account that sub-areas A and B are of
4/23/01
Uncertainty
17
Johns Hopkins University
What is Engineering?
Michael Karweit
equal size. We did not make that assumption in deriving var (  ) . The two expressions
would be similar (except for the D2 term), if NA and NB were taken as N/2 to reflect
equal areas and equal sampling of each area.
PROPAGATION OF ERROR
To conclude this section on random processes we’ll discuss their impact on
measurement error. Usually, we will take some measurement m̂i and presume that it
consists of the real value m and some error ei, i.e., m̂i = m + ei. And we presume that the
error ei has zero average. This means that the average value of a measurement m̂i would
be expected to equal the true value m. In other words, the average value of the ei’s is
zero. If the ei’s do not have zero average then our measurement is biased. Or we can
say that it contains a systematic error. Of course, we always try to carry out a
measurement that does not contain a systematic error.
But, just because we can carry out a measurement with an average error of zero,
does not mean that we will be free of systematic error. Depending on how we use that
measurement, it is possible to introduce one. We can illustrate this with a very simple
example. Suppose we want to estimate the area of a square. So, we measure the length
of a side, and we square that value. Just to make sure, we do this a number of times and
take an average:
1 N 2 1 N
1 N
2m N
1 N 2
mˆ i   (m  ei ) 2   m 2 
ei   ei


N i 1
N i 1
N i 1
N i 1
N i 1
The right hand side of this equation has three terms. The first is the true area m2. The
second is the average of a random variable with zero mean. So, it’s zero. But the third is
the average of a random variable which is squared. So every item in the sum is positive.
This term has introduced a bias into our estimate of A, even though our measurement
error was not biased. So, we will always overestimate A.
A
The reason that we have created this bias is because we have used the
measurement (and the error) in a non-linear way. This means that the error does not
appear in the calculations just as a first power. Here, in one term, the error is squared. Is
there a way to carry out the measurement so that we don’t introduce such a problem? In
general, no. But, in this case, yes. All we need to do is to take two measurements of the
square: one to measure the height; one to measure the width--even though they're
supposed to be the same length. If we do that, then we can calculate the average area as
1 N ˆ
1 N
1 N
w N
h N
1 N
ˆ
h
w

(
h

e
)(
w


)

hw

e



 i i N

 i N
 ei  i , where
i
i
i
N i 1
N i 1
N i 1
N i 1
i 1
i 1
h and w are the true height and width of the square, and ei and i are the errors in taking
those measurements. Now, if you look at the terms, all but the first average to zero.
Terms two and three average to zero because the random errors ei and i have zero
A
4/23/01
Uncertainty
18
Johns Hopkins University
What is Engineering?
Michael Karweit
averages. Term four is the sum of eii --errors which are independent of one another. So,
the expected value of this product is zero: some terms will be positive; some will be
negative.
An example of a measurement which will
always have a biased error is estimating the
height of a tree by measuring out some distance
X from the tree, then measuring the angle  to
the top of the tree. Then, the height of the tree h
= X tan().

X
In N measurements, we would obtain an average height of the tree as:
N
1 N ˆ
1 N
1 N
ˆ) 1
X
tan(

(
X

e
)
tan(



)

X
tan(



)

 i


 ei tan(   i )
i
i
i
i
N i 1
N i 1
N i 1
N i 1
Here, the mathematics begins to get sticky. On average the second term is zero because it
is the product of one random error times a function of another, independent, random
error. But the average of the first time is not X tan(). It's value will depend on . And,
since there is no closed form expression which separates  from the other variables, we
can't even calculate its effect. However, if you're curious, you can experiment with the
random function simulator and see for yourself. Especially, when  is large--like greater
than 60o--the nonlinearity in the problem yields quite biased results.
h
The Calculus of errors
There's another way to estimate the effect of measurement error that has nothing
to do with probability or random processes. It involves the way a function F(x,y)
changes as x and y change--the essential ideas of calculus. We'll begin with a problem.
Suppose we want to calculate the volume of a structure that consists of a cone resting
upon a rectangular parallelopiped. The total volume of this structure is:
1
V   R 2 H c  LWH .
3
We will not measure V directly, but rather we'll calculate V by taking measurements of R,
Hc, L, W, and H. But suppose these measurements are not perfectly accurate. So the
question is how much error will we introduce into our calculation of V by using
inaccurate values for the measured variables?
If each measurement is in error, then the calculated volume would consist of the
true volume V plus an error v. The relation between the error-borne measurements and
the resulting calculated volume would be:
1
V  v   ( R  r ) 2 ( H c  hc )  ( L  l )(W  w)( H  h) .
3
Expanding this equation, then subtracting out the equation for V, we get
4/23/01
Uncertainty
19
Johns Hopkins University
What is Engineering?
Michael Karweit
1
v   [2 RH c r  R 2 hc  H c r 2  2 Rrh c  hc r 2 ] 
3
+ HWl  HLw  LWh  Hlw  Lhw  Whl  lwh .
First, let's look at v statistically, i.e., what would be the average value of the error v if
measurements were taken many times and averaged. Using overbar notation, we obtain
1
v   [2 RH c r  R 2 hc  H c r 2  2 R hc  hc r 2 ] 
3
 HWl  HLw  LWh  H lw  Lhw  W hl  lwh .
If hc, r, l, w, and h are all independent random variables with zero mean, then we get
1
v   H c r 2 . All the other terms average to zero.
3
Thus, we have deduced an average expected error in v. But, suppose we can't take
a lot of sample measurements, and we would like to know what is the "worst case" error.
That's fairly simple. Suppose we can estimate the maximum possible error on each
measurement. Let these be labeled rmax, lmax, hc max, hmax, and wmax. Then
1
2
2
v max   [2 RH c rmax  R 2 hc max  H c rmax
 2 Rhc max  hc max rmax
]
3
 HWl max  HLwmax  LWhmax  Hl max wmax  Lhmax wmax  Whmax l max  l max hmax wmax
A simplification of this is to assume that rmax, lmax, hc max, hmax, and wmax are all very small
compared to R, L, Hc, H, and W. Then terms containing two or more of these maximum
errors will be much smaller than terms containing only one. Consequently, if we ignore
these smaller terms, vmax can be approximated as
1
v max   (2 RH c rmax  R 2 hc max )  LWhmax  LHwmax  WHl max .
3
There's a reason for making this simplification, even though it's only an approximation to
the maximum error in v. The reason is that this simplified expression is easily deduced
for any combination of measurements. This is the result that one would obtain by taking
1
the total differential of the function V ( R, H c , L,W , H )   R 2 H c  LWH . The total
3
total differential is defined like this: If F is a differentiable function depending on n
variables x1 , x 2 , , x n , then infinitesimal variations in F are determined by infinitesimal
variations in the xis as:
F
F
F
dF ( x1 , x 2 ,..., x n ) 
dx1 
dx 2   
dx
 x1
 x2
 xn n
dF is called the “total differential” . Our variations rmax, lmax, hc max, hmax, and wmax are not
infinitesimal. But, if they're quite small, then the total differential is a pretty good
approximation to the total error vmax (that is, dF in the notation immediately above).
4/23/01
Uncertainty
20
Johns Hopkins University
What is Engineering?
Michael Karweit
Notice that the value for vmax really is an error that you would never expect to
have. Not only are all of the measurement errors assumed to be at their maximum, but all
of them are contributing with the same sign. That is, there are no canceling errors. If the
measurement errors are truly random variables with zero mean, the expected error in a
calculated value of V would be much less. Nevertheless, estimating error using this
calculus can be extremely valuable, especially if you must absolutely determine some
parameter within a specified error.
Another way of representing this maximum error is by percentages. In our
example, if the total volume V is separated into its constituent pieces V = Vc + Vp , where
the subscripts c and p refer to the cone and parellelopiped, respectively, then the above
equation can be rewritten as
vmax
r
h
h
w
l
 Vc (2 max  c max )  V p ( max  max  max ) .
V
R
Hc
H
W
L
This equation shows that percentage errors in the parallelepiped measurements, e.g.,
hmax
, are linearly additive with weight Vp, whereas a percentage error in the
H
measurement of R is doubly additive with weight Vc.
V
This technique is described as the “total differential” or “calculus of errors” in
elementary calculus books.
Why do we care about all this stuff: propagation of errors, lower variances of
estimates, Gaussian distribution functions? Are these ideas just academic curiosities? Or
are they actually important to engineers? The answer is they are actually important—
really important.
Let’s take the case of counting trees—something that would appear to be a makework exercise. Let’s put it into an engineering context. Suppose you are an engineer
working for a lumber company which is trying to decide how much money it should offer
for 100,000 acres of forested land—we’re talking about tens of millions of dollars. The
land is valuable to the lumber company for its timber, and that depends on how many
trees are on the plot. If more money is offered for the land than its timber value, the
company loses money. If less money is offered than its timber value, then another
company is likely to outbid yours. So you need to make the right offer—not too big, not
too small. Of course you can actually count the trees. Then you could make a bid with
almost perfect accuracy. But how much time and money would it take for you to actually
count the trees on 100,000 acres? Far too much. So, the engineering solution is to make
an estimate based on sampling. The more samples, the more accurate the estimate, the
more the costs. Remember, you’re in competition with other companies bidding for the
land. What you need to do is optimize your estimation procedure—obtain the most
accurate estimate at the least cost. In this example, that might be carried out through
stratified sampling. What we covered above was only an introduction to the concept of
stratified sampling. In reality, there are whole books on the subject: how to get the most
4/23/01
Uncertainty
21
Johns Hopkins University
What is Engineering?
Michael Karweit
information, mostly accurately, and at least cost. There’s that universal engineering
parameter again: $$$.
Uncertainty is an integral part of engineering and science. Understand it, because
every measurement, every experiment, every real situation has uncertainty. Understand
it, because it’s central to the design of earthquake-resistant buildings, safety factors for
airplane wings, and production strategies in a market economy. Understand it, because it
represents those things that an engineer does not know about a given physical situation—
the details of the process, the variability of the environment, the error in the
measurement. Sometimes engineers don’t want to know anymore. Because good
statistics can give them enough information to proceed. But engineers need to know how
statistics characterize uncertainty. Only then can the engineer use it as a tool to solve his
problems.
4/23/01
Uncertainty
22