Download handout

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Least squares wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Transcript
“How Many Tanks Are There”
Teaching Contemporary Mathematics conference 2011
NC School of Science and Math
Philip Rash
Introduction
We begin with a classic problem, often referred to as the “German Tank Problem.” Suppose you are an
Allied Forces commander, and have captured a small handful of German Tank serial numbers. Assuming
the Germans numbered their tanks from 1 to (some unknown maximum), how can we estimate the
value of ?
I normally use this activity to introduce the concept of sampling distributions and estimation in an AP
Statistics course (i.e. Chapter 9 in Practice of Statistics, Yates, Moore, and Starnes). Without giving
students much more information than that, I give students a set of 7 tank serial numbers and tell them
that their job is to estimate the total number of tanks. But more than that, they have to invent their own
method (or algorithm or formula) to estimate that maximum – a method that someone could easily
apply to any set of tank serial numbers. We also agree on a few simplifying assumptions: that tanks are
indeed numbered from 1 to , and each tank is equally likely to have been observed.
Possible solutions
Following is a collection of methods, most of which I have seen students propose:













(since the mean is in the middle, twice the mean should be the upper end)
(similar justification as previous)
(since the
contains the middle 50% of the data, adding that to the
middle should be the upper extreme)
(Two standard deviations above the mean usually captures most of a data set.)
(Three standard deviations above the mean usually captures most of a data set.)
, where
is the number of tanks in our sample
(this is algebraically equivalent to an idea a student had this year: start with
, but since that will likely overestimate the maximum, subtract from that value
the amount by which
underestimates the minimum, 1)
(The justification for this is that the “gap” between the sample maximum and
ought to be about as big as the “gap” between the sample minimum and 0.
(compute the normal distribution probability of having a z-score less
than the z-score of the sample maximum, then divide the sample maximum by this
probability)
Assessing the methods
Next we turn our attention to deciding which method is “best.” But of course first we need to articulate
what we mean by “best.” This is the part that is sometimes a leap for students: though we really only
have one sample of tank serial numbers, and we really do not know what the real maximum is, we’re
going to temporarily assume that we do know the maximum, take a random sample, and see how well
each method estimates that known maximum. The thinking is that if we can develop some confidence in
a particular method (under conditions in which we know the value that is to be estimated), then that
confidence extends to when we honestly do not know the value we’re trying estimate.
For example, let’s assume that there are
tanks in the German army. We’ll draw a random
sample of 7 tanks from this population and compute our estimate for the maximum using our chosen
method. We’ll repeat this many times and look at the distribution of those estimates.
First we think about the notion of an unbiased estimator – that is, a statistic whose mean is equal to the
parameter the statistic is meant to estimate. Following is a table of our candidate methods and their
means (based on
and 10000 simulations).
Estimate
Mean
350.3
351.9
305.3
372.4
471.1
350.6
350.7
Error
0.3
1.9
-44.7
22.4
121.1
0.6
0.7
349.7
481.6
458.7
350.5
591.9
340.7
-0.3
131.6
108.7
0.5
241.9
-9.3
Of our candidate methods, several seem to be unbiased:
,
, and
,
. We also note that
,
,
and
are “nearly” algebraically equivalent (the difference being whether we consider 0 or 1 to be the
minimum tank value), so among those 2 methods from this point on we’ll only consider
.
So now we need to distinguish among our unbiased estimators which is “best.” Among these, the
estimator with the least variance would, in a sense, give us the most information about the parameter
we’re trying to estimate. In other words, less variance means having a greater probability of seeing less
than a given amount of error.
Estimate
We note that the
Mean
350.3
351.9
350.7
Standard
Deviation
76.6
116.1
44.2
349.7
350.5
88.1
58.2
method (or nearly equivalently, the
method) is
both unbiased and has minimum variance.
Attached as an appendix are histograms of each of our candidate methods. It’s interesting to note not
only the center and spread, but also the shape of each distribution. For a given method, for example,
how likely is observing a value more than a certain distance from our assumed maximum? These types
of questions lead well into discussions of hypothesis testing.
Also attached is a copy of a student handout I often use with this activity. It’s written for students to use
JMP (a statistical software program), but could be adapted for other software. Also, the second page of
this handout references a website where students can simulate drawing samples from many types of
populations. They can see sampling distributions for many different sample statistics, such as mean,
median, variance, range, etc.
Credits
Recognition to two of my NCSSM colleagues: Floyd Bullard authored most of the student handout, and
Dan Teague provided valuable resources as well.
http://www.guardian.co.uk/world/2006/jul/20/secondworldwar.tvandradio describes the historical
problem, noting that allied statisticians estimated the number of tanks at 246 (produced per month
from 6/1940 to 9/1942). Post-war records revealed the actual value to be 245.