Download Chapter 1 Probability, Statistics, and Reality

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Chapter 1
Probability, Statistics, and Reality
If your experiment needs statistics, you ought to have done a better experiment.—Ernest
Rutherford (attrib.)
A branch of Science that must depend heavily on statistical inference is therefore in grave
peril of error. Without the most rigorous and critical analysis of its concepts and methods it
may fall to worshipping very strange gods.—John Ziman (1968) Public Knowledge: an
Essay Concerning the Social Dimension of Science .
It is commonly only in their 4th or 5th [undergraduate] years that geophysics students first
come into contact with statistical reality, usually in the form of some practical project or
their MSc thesis. Then their fate is suddenly made manifest to them: they need statistics
to analyze their data. Somehow, by hook, crook, or cookery book, they acquire a smattering
of statistical jargon sufficient to appease the sadistic examiner with an eye for the missing
confidence interval or misunderstood significance level. Small wonder if they thereafter
view statisticians with reserve, as at best the curators of an elaborate maze through which
they were compelled to wander, and at worst the cause of their downfall.—David VereJones (1975), Stochastic models for earthquake sequences, Geophys. J. R. Astron. Soc.42,
811-826.
1. The Aim of the Exercise
The quotations above attempt to explain why this course exists. First of all,
geophysics (like astronomy) is an observational science, in which we cannot do
experiments—so Rutherford’s dictum is unhelpful. We have to understand the magnetic field of the Earth as it is, without recourse to manipulation in the laboratory.
While statistical methods are not confined to the observational science—nowadays
even nuclear physicists use them1—they are so central to doing geophysics that they
are a key part of a geophysicist’s training.
The challenge of such training is that, as our second and third quotations indicate, statistics is difficult to learn and to do well. This difficulty is not just because
of the mathematics involved, but also because we often reason poorly about random
events, to the benefit of casinos and the frustration of students. We hope this course
will help, though there is really no substitute for lots of experience, careful thought,
and a willingness to subject one’s intuition to careful checks.
In a one-quarter course we can only cover some aspects of the subject, though
we hope to discuss much of what is needed in geophysics—which is not necessarily
the same as the standard set of subjects in most introductory statistics books. We
postpone to the second part of the course the very important case of data arranged in
time or space. We neither seek nor shirk mathematics, though we only rarely
attempt rigor; we shall show a number of theorems, but prove few. Familiarity with
calculus, at least through multivariate calculus, and with vectors, vector spaces, and
matrices, should be adequate mathematical background.
1
Actually, this has been true for as long as there has been nuclear physics: for example, using
the statistics of a Poisson process (which we will describe below) to show that radioactive decay
occurs randomly. See E. Rutherford and H. Geiger, The probability variations in the distribution
of alpha-particles, Phil. Mag. Ser 6, 20, 698-707 (1910).
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-2
To give some of the flavor of statistical reasoning, we offer some examples of
how it is applied—including one case of it being applied badly. While we use a few
terms from probability and statistics in these examples, they are names that you are
likely to already have some familiarity with; in any case we will define them more
precisely in later chapters.
Figure 1.1
1.1. What Kind of Problem Are We Trying to Solve?
To offer something more concrete, we begin with a couple of examples of data,
which show two different situations in which statistical reasoning can be needed.
The left-hand plot in Figure 1.1 summarizes two years of measurements between
two geodetic markers about 50 meters apart, using the satellite Global Positioning
System (GPS). We plot these data in the familiar form of a histogram, which shows
that the data tend to clump around a value of 50327 mm, but are spread around over
a range of ±1 mm. In this case the usual frame for looking at the problem is what
we might call the ‘‘physics-lab’’ one: there is a single, true value of the distance, and
the reason that we get numbers that are scattered around this is because of errors in
the measurements. Statistics then is just a necessary tool for dealing with something we would rather not exist in the first place.
This frame is fine in the laboratory, but often not useful outside it. For an
example, consider the right-hand plot in Figure 1.1, showing a histogram of the
lengths of segments of oceanic spreading centers (terminated by some other kind of
plate boundary). In this case, there is no ideal ‘‘true’’ length, and also (at this scale)
not much measurement error.2 Rather, the way in which the numbers are distributed
is the information, which we treat using statistical methods.
What is common to these examples is that we have numbers that are in some
way scattered, or varying. We use probability theory, and the statistical methods
that flow from it, because this a mathematical construct most suited to dealing with
2
Though we may wonder, reasonably, about the effect of sampling: a long segment may be one
that doesn’t have many ship tracks crossing it.
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-3
such variations. More precisely, our aim, given a set of data, is to develop a mathematical model for it: such models are called probability models or stochastic models.
Table 1
2, 7
8, 3
6, 3
9, 3
8, 11
2, 7
9, 7
7, 8
6, 8
5, 7
2, 10
4, 9
8, 5
5, 8
5, 5
7, 7
5, 5
5, 5
7, 5
11, 4
5, 5
8, 7
10, 8
4, 4
8, 8
9, 5
8, 5
6, 6
2, 6
10, 7
7, 4
5, 6
6, 5
8, 7
9, 7
5, 4
5, 6
8, 8
2, 5
5, 5
7, 9
5, 7
3, 4
6, 3
8, 6
12, 6
8, 5
7, 5
4, 6
4, 4
4, 8
9, 7
10, 3
4, 6
7, 8
7, 6
5, 6
8, 7
3, 8
4, 8
10, 5
6, 7
9, 6
5, 10
10, 5
2, 6
4, 6
10, 5
6, 7
7, 4
6, 8
8, 11
5, 6
5, 8
7, 5
6, 5
4, 3
4, 6
6, 5
11, 5
7, 9
3, 5
10, 7
6, 8
8, 6
6, 6
8, 8
9, 5
7, 6
7, 9
5, 7
7, 6
3, 5
9, 8
7, 6
2, 9
9, 7
4, 5
7, 4
2, 6
Figure 1.1
1.1.1. A Toy Example of a Stochastic Model
To show such a model, it is best to start with a made-up example. Table 1
shows 100 pairs of data x1, x2 ) that might be collected from repeating an experiment;
clearly the numbers vary considerably, though (by design, in this case) they are all
integers. The first thing to do with any dataset should be to plot it: humans are
visual, not (naturally) numerical, and it is as well to make good use of this. Figure
1.1 shows a plot of 1000 data pairs. Since the values are all integers, we have moved
each point away from the true value by a small random amount so we can see the
relative number at each value, a plotting trick known as dithering. (We will discuss other good and bad plotting methods later on). The figure also shows the relative number of points with particular values of x1 and x2 , in plots on the right.
Clearly there is a pattern to the relative occurrences of the different values.
In fact, these data were generated in the following way: take three numbers,
each with an equal chance of being 1, 2, 3, or 4, and add them up to get the value of
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-4
x1 ; and take two numbers, each with an equal chance of being 1, 2, 3, 4, 5, or 6, and
add them up to get the value of x2 . This could be done by rolling tetrahedral and
cubical dice, though in this case it was done by a software simulation.
This brief description, which is enough to explain all these data, is an example
of a stochastic model: a mathematical construction that we develop to explain some
set of observations. This model includes, as a basic element, some aspects that are
‘‘random’’, meaning that they do not have definite values, but only different chances
of taking on different values. In the model we have set out, we start with elements
that have equal chances of taking on integer values within certain ranges, and combine them to get our final result. These elements are called random variables, and
a good part of the application of stochastic models comes from deciding what parts of
your data are best modeled by variables of this type, and which by variables which
take on only one value. The mathematics of random variables is part of probability
theory, which like any mathematical theory is the logical development of the behavior of certain kinds of mathematical entities. Random variables are such an entity,
designed (as their name suggests) to mimic outcomes that in the real world are the
result of ‘‘chance’’.
Figure 1.2
1.2. Distance Measurement (I): Point Estimation
We now return to the first example data set, shown again in histogram form on
the left plot in Figure 1.2. To go beyond simply noting that the data are clumped
around a value, we have to assume that we can represent the data through a mathematical model that includes random variables. such a stochastic model we are
asserting that the data we have are at least in part the result of a random process.
This is a large conceptual leap—and like most applications of mathematical models
to the real world, not one that is simple to justify. You should always remember that
assuming a relationship between data and a model that describes them should be
done with care, for it is, in its own way, chancy: a stochastic model might or might
not be appropriate.
The specific mathematical model we adopt is that the length can be modeled as
a random variable X ; this variable is described by the equation
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
x
 1  u − m 2 
P ( X < x) =
exp − 2
du
1
 σ  
σ (2π ) 2 −∞


1
1-5
∫
(1)
The right-hand side of this equation should be familiar: it describes a function of x
that depends on two parameters m and σ , which have definite (but at this point
unknown) values. The left-hand side is where the novelty lies, both in notation and
in concept. The expression means, ‘‘The probability that the random variable X is
less than a value x.’’
But what do we mean by probability? We will describe its mathematical properties in the next chapter. What it ‘‘really means’’, in the sense of what it corresponds to in the real world, is a deep, and deeply controversial, question. A common
‘‘working definition’’ is that it is the fraction of times that we get a particular result
if we repeat something: the frequency of occurrence, as in the canonical example
that the probability of getting heads on tossing a fair coin is 0.5. While there are
good objections to this interpretation, there are good objections to all the other ones
as well. Certainly the frequency interpretation seems like the most straightforward
way to connect equation (1) to the data summarized in Figure 1.2.
So, we have data, and an assumed probability model for them. What comes
next is statistics: using the model to draw conclusions from the data. Probability
theory is mathematics, with its own rules—albeit rules inspired by real-world examples. Statistics, while it makes use of probability theory, is the activity of applying it
to actual data, and the drawing of actual conclusions. Statistics can perhaps best be
thought of as a branch of applied mathematics, though for historical reasons nobody
uses this term for it.
Given our model, as expressed by equation (1), and given these data, an immediate statistical question is, what are the ‘‘best’’ values for m and σ ? Assuming that
(1) in fact correctly models these data, the formulas for finding the ‘‘best’’ m and σ
are
m̂ =
σ̂ 2 =
1
N
1
N −1
N
Σ di
i =1
(2)
N
(d i − m̂)2
Σ
i =1
where there are N data d1, d2, ... , d N . Applying these expressions to the data from
which Figure 1.3 was drawn, we get m̂ = 50326. 795 and σ̂ = 0. 18. The parameters m
and σ in equation (1) are called the mean and standard deviation; the values m̂
and σ̂ that we get from the data using the formulae in (2) are called estimates of
these. (As is standard in statistics, we use a superposed hat symbol to denote an
estimate of a variable). Why these estimates might be termed the ‘‘best’’ ones is
something we will get to; this is part of the area of statistics known, unsurprisingly,
as estimation theory. Finding the best values of parameters such as m and σ is
called point estimation. As we describe in the next subsection, this is the answer
to only one kind of statistical question; perhaps the commonest, but not always the
most relevant.
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-6
The interpretation of these data as one of a true value contaminated by errors
leads to a formally similar but philosophically quite different statement: that the
data can be modeled as
d i = t + ei
(3)
where t is the true value of the distance, and ei are the ‘‘errors’’; we model the errors
as being a random variable E, such that
P (E < e) =
e
 1  u 2 
exp
− 2    du
1
σ 
σ (2π ) 2 −∞

1
∫
(4)
Mathematically, (3) and (4) are equivalent to (1) if we have t = m, which means that
if we assume (3) and (4) our ‘‘best estimate’’ of t is m̂. This is of course the bit of
statistics scientists do most often: given a collection of repeated measurements, we
form the average, and say the answer is the true result we are trying to find. There
is nothing really wrong in this—until we start to ask questions such as ‘‘How uncertain is t?’’ Such a question, while tempting, is nonsense: if t is the true value, it cannot be uncertain mathematically, it is some number, not (like the e’s) a random kind
of thing. We will therefore avoid this approach, and instead will try to pose problems in terms of equations such as (1), in which all the non-random variables are, as
it were, on the same footing: they are all parameters in a formula describing a
probability.
We can use our estimates m̂ and σ̂ to find how the data would be distributed if
they actually followed the model (1); the right side of Figure 1.3 shows the result in
the same histogram form as the left side. You might wish to argue, looking at these
plots, that the model is in fact not right, because the data show a sharper peak, just
to the left of zero, than the model; and the model does not show the few points at
large distances that are evident in the data. As it turns out, these are quite valid
objections, which lead, among other things, to something called robust estimation:
how to make point estimates that are not affected by small amounts of outlying
data. But the underlying question, of deciding if the model we are using is a valid
one or not, is quite a different issue from estimating model parameters; so we discuss it in a new section.
1.3. Another Example: Hypothesis Testing
Figure 1.4 plots another dataset: the top two traces show the reversals of the
Earth’s magnetic field, as reconstructed mostly from marine magnetic anomalies,
over the last 159 Myr. We plot this, as is conventional, as a kind of square wave,
with positive values meaning that the dipole has the same direction as now, and negative ones that it was reversed. This is an example of a geophysical time series of a
particular kind, namely a point process, so called because such a time series is
defined by points in time at which something happens: a field reversal, an earthquake, or a disk crash. The simplest probability model for this is called a Poisson
process: in any given unit of time (we suppose for simplicity that time is divided
into units) there is a probability p that the event happens. By this specification, p is
always the same at any date, and it does not matter if it has been a long time or a
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-7
short time since the last event. Such a model, which doesn’t know what time it is,
does not describe the actual times of the reversals. What it does describe is the
intervals between successive events, which we plot on the lower left of Figure 1.2,
again as a histogram. We use the log of the time because the intervals have to be
positive—and also because they are spread out over a large range, from 2 × 104 to
4 × 107 years. The longest interval (the ‘‘Cretaceous Normal Superchron’’) ran from
124 Myr ago to 83 Myr ago. Looking at the time series, or the distribution, naturally
raises a question: is this interval somehow ‘‘unusual’’ compared to the others? If it
is, we might wish to argue that the core dynamo (the source of the field) changed its
behavior during this time.
Figure 1.4
In statistics questions of this type are called hypothesis testing, because we
can only say that a behavior is unusual in comparison to some particular hypothesis.
In statistics, the meaning of ‘‘hypothesis’’ is more limited than in general usage: it
refers to a particular probability model. Because we are dealing with probability, we
can never say that something absolutely could not have occurred, only that it is very
improbable. The mode of reasoning used for such tests is somewhat difficult to get
used to because it seems backwards from the way we usually reason: rather than
draw a positive conclusion, we look for a negative one, by first creating a model that
is the opposite of what we want to show. This is called the null hypothesis; we
then show that the data in fact make it very unlikely that this hypothesis is true.
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-8
For this example, a null hypothesis might be the probability model that there is
an equal chance of a reversal in any 40,000 year interval—we choose this time
because shorter reversals do not seem to be common, either because they do not happen or because they are not well recorded in marine magnetic anomalies. We call
this chance (or probability) p, and we assume it is the same over all times: these
assumptions combine to form our hypothesis, or (again) what we could call a stochastic model. If we estimate this p using the distribution of intervals, we find that it is
(roughly) 0.1. This gives the histogram of inter-reversal times shown at the lower
right of Figure 1.4. This is actually not a very good fit to the data but we can take it
as a first approximation. Now, 40 Myr has 1000 intervals of 40 kyr; the probability
of there not being a reversal over this many intervals consecutively is (1 − p)1000 , or
2 × 10−46 . So, over the 160 Myr of data we have (four 40-Myr intervals), we would
expect to get such a long reversal-less span about one time out of 1045 —which we
may regard as so unlikely that we can reject the idea that p does not change over
time. Of course, we only have the one example, so it is (again) arguable whether or
not saying ‘‘one time out of N ’’ is the right way to phrase things; but given how small
the probability is in this case, we may feel justified in any case in rejecting the
hypothesis we started out with. But, we should always remember that our decision
to reject depends on a judgment about how small a probability we are willing to tolerate—and this judgment is, in the end, arbitrary.3
1.4. Distance Measurement (II): Error Bounds
Our discussion in Section 1>2 of the stochastic model for the distance measurement was about the ‘‘best estimate’’ of the parameter m. But this is not the only
question we could ask relating to m; we might reasonably also ask how well we
think we know it—that is, in the conventional phrasing, how large the error of m̂ is.
This question actually is itself a hypothesis test, or rather a whole series of such
tests, for each of which the hypothesis is ‘‘m is really equal to the value ____; given
our model, is this compatible (that is, likely) given the data observed?’’ If the
assumed value were 50320 mm, or 50330, the answer would be, not very likely; if the
assumed value were 50326.8, the answer would be, quite possible. We can in fact
work out what this series of tests would give us for any assumed value of m; then we
choose a probability value corresponding to ‘‘not very likely’’, and say that any value
of m that gives a higher value from the hypothesis test is acceptable. This gives us,
not just a value for m, but what is usually more valuable, a range for it, this range
being between the confidence limits.
1.5. Predicting Earthquakes: A Model Misapplied
We close with an example of misapplied statistics leading to a false conclusion—and an expensively false one at that. Our example is, fittingly, one of earthquake prediction, a field that has more examples of inept statistical reasoning (as
well as blissful unawareness of the need for such reasoning), than any other branch
3
A more thorough analysis (C. Constable, On rates of occurrence of geomagnetic reversals,
Phys. Earth Planet. Inter., 118, 181-193 (2000)) shows that the Poisson model can be used, provided we make the probability time-dependent. A model consistent with the data has this probability diminishing to a very low value during the Cretaceous Superchron, and then increasing to
the present.
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-9
of geophysics.
The particular case was that of the repeating earthquakes at Parkfield, a very
small settlement on the San Andreas fault in Central California. Earthquakes happened there in 1901, 1922, 1934 and 1966; seismometer records showed the last
three shocks to have been very similar. Nineteenth-century reports of felt shaking
suggested earthquakes at Parkfield in 1857 and 1881. This sequence of dates could
be taken to imply a somewhat regular repetition of events.
You might think, from what we discussed in the previous section, that the relevant data would be the times between earthquakes—and you would be right. However, the actual analysis took a different approach, shown in the left panel of Figure
1.5: the event numbers were plotted against the date of the earthquake, and a
straight line fit to these points. The figure shows two fits, one including the 1934
event and the other omitting it as anomalous. If the 1934 event is included, the
straight line reaches event number 7 in 1983 (the left-hand vertical line); this was
known not to have happened when the analysis was done in 1984. Excluding the
1934 event yielded a predicted time for event 7 of 1988.1, ±4.5 years.
Partly because of this prediction, which seemed to promise a payoff in the near
future, a massive monitoring effort was set up around Parkfield, which was continued long after the ‘‘end’’ of the prediction. The earthquake eventually happened in
September 2004, 19 years ‘‘late’’.
Figure 1.5
What went wrong? The biggest mistake, not uncommon, was to adopt a set of
‘‘standard’’ methods without checking to see if the assumptions behind them were
appropriate: an approach often, and justly, derided, as the cookbook method. In this
case, the line fitting, and the range for the predicted date, assumed that both the xand y-coordinates of the plotted points were random variables with a probability distribution somewhat similar to equation (1). But the event numbers, 1 through 6, are
as nonrandom as any sequence of numbers can be; the dates are not random either,
for they have to increase with event number. A more careful analysis of the series
shows that while the time predicted for the next earthquake in 1984 would still be
© 2008 D. C. Agnew/C. Constable
Version 1.3
Introduction
1-10
close to 1990, the range of ‘‘not-improbable’’ times would be much larger—the earthquake was nowhere near as imminent as the incorrect analysis suggested. No doubt
some level of monitoring would have been undertaken anyway but a more thoughtful
approach might have been taken if the statistical analysis had been done properly.4
A simple way of seeing what a better analysis would give is suggested by the
right-hand plot in Figure 1.5, which shows the inter-event times ordered by size, and
labeled by the date when each ended. The longest interval has an arrow extending
from 1984 (when the prediction was made) to the actual time of the earthquake. It
seems clear that the range of prior inter-event times is such that the most reasonable prediction for the next earthquake in 1984 would have been would have been
‘‘probably soon, but a 10-20 year wait might well be expected’’.
The lesson is that you should learn not just a set of techniques to use, but also
when not to use them. In this case, one approach, and one kind of plot, were seriously misleading; a different presentation of the data would immediately have led to
different conclusions.
4
The original prediction was by W. Bakun and T. V. McEvilly, Recurrence models and Parkfield, California, earthquakes, J. Geophys. Res., 89, 3051-3058 (1984). The correct analysis is Y.
Y. Kagan, Statistical aspects of Parkfield earthquake sequence and Parkfield prediction experiment, Tectonophys., 270, 207-219 (1997). This particular error is not unique to the Parkfield
analysis; see Stephen M. Stigler, Terrestrial mass extinctions and galactic plane crossings,
Nature, 313, 159 (1985).
© 2008 D. C. Agnew/C. Constable