Download A short introduction to probability for statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Mixture model wikipedia , lookup

Probability interpretations wikipedia , lookup

Random variable wikipedia , lookup

Randomness wikipedia , lookup

Central limit theorem wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
A Short Introdution to Probability for
Statistis
Math 146
Part I. Modeling Observations
1 Statistial observations
Statistis is onerned with the analysis of quantitative observations, usually
repeated several times. For example, we might onsider
•
•
•
•
Asking individuals about their opinions or preferenes
Measuring some physial or hemial quantity, e.g. the onentration of a
hemial in a water pool
Analyze a ow of bits to determine if a signal is present, or whether it is
just noise
Observe some signiant marker to determine whether a medial treatment was eetive
We ould go on for a long time. In suh ases, we are not so muh onerned
about the single result of our observation, but rather what, if any, impliation
our observation(s) have for a more general onlusion, that is, was our observation just a random outome, with no general impliation, or was it evidene
of a more general fat?
Suh a question annot be answered simply by looking at the data, sine
we are wondering about what is there behind the data. A reasonable answer
will, neessarily, require that we add our own insight to the data, whih, to
be as eetive as possible, means having a (preferably mathematial) model
for the whole observation proedure. This approah, developed at the turn of
the 20th Century, has proved to be really eetive, as opposed to the 18th-19th
Century fous on searhing for representative observations, observing typial
situations that ould be extrapolated to more general ases. The problem with
the latter approah (whih survives, in a way, at least at the margins) is that it
is pratially impossible to dene what is typial in an objetive way.
The fundamental mathematial toolbox that has turned out to work is Probability Theory. To get deep into this theory we would need a lot more mathematial mahinery than what is available to us here, but we an get an intuitive
understanding if we an take a few fats for granted.
1
2 A model for statistial observations
2
2 A model for statistial observations
We will onsider observations that result in numerial data. The simplest experiment that we an onsider is the ip of a oin. Let's say we are asking whether
it will turn up heads, and we reord 1 if it does, 0 if it turns up tails. While
there will be only one outome, as far as we an tell beforehand, it ould be a
0 or a 1. We all suh an unertain outome a
Random Variable :
a variable
beause it an take more than one value, and random beause we annot tell
with ertainty what this value will be. As usual in math, we assign a symbol to
this random variable, let's say, for example, we all it
X.
A probabilisti model onsists in assigning to eah possible value of
X
(there
are two in this ase, but, in general, there will be many more) a number between 0 and 1 whih, loosely speaking, quanties the likelihood of eah value
ourring, 0 meaning that it will not our, and 1 that it will ertainly our
an intermediate value orresponding to unertainty about its ourrene (the
loser to 1, the higher the likelihood of it ourring). We all these numbers
probabilities, and write, in our example,
P [X = 1] = p
where
p
is the probability that we assign to the oin turning up heads.
2.1 Consisteny of probability assignments
We assigned a probability to the ourrene (the tehnial name in probability
is event)
X = 1,
so what about
X = 0?
a onsisteny requirement, tehnially an
This an be determined by listing
axiom,
that probability assignments
have to satisfy.
Additivity
Outomes that annot our simultaneously have a ombined proba-
bility that is the sum of the individual probabilities. In our ase, the oin
will turn out either heads or tails, not both, so the probability that either
one or the other ours will be alulated as
P [X = 0
or
X = 1] = P [X = 0] + P [X = 1]
In this simple ase that determines
P [X = 0],
(1)
beause we already stipulated
(that's another axiom, even if we did not ag it as suh) that something that
will ertainly our is assigned probability 1, and it is ertain that either heads
or tails will our, hene that
P [X = 0
or
X = 1] = 1.
From (1) it then follows
that
P [X = 0] = P [X = 0
or
X = 1] − P [X = 1] = 1 − P [X = 1] = 1 − p
2 A model for statistial observations
3
2.2 How do we assign probabilities?
This question is surprisingly deliate, and has led to very heated arguments.
Simplifying things, we an list three basi approahes note that they are
not mutually exlusive, meaning that, depending on the spei situation a
researher might resort to one of these approahes, only to use another one in
a dierent ase.
1. The Classial Model. This relies on the ability to set up a model where
a random variable
X
an take a nite number of values, and, sine we see
no reason to believe that one value is more or less likely than any other
1,
eah is assigned the same probability. In the oin tossing ase, if we have
no reason to think that the oin is biased (that is, that it will more likely
turn up on one side rather than the other), we would assume p = 1 − p,
p = 21 . A slightly riher ase would be a roulette wheel, where, in most
Amerian roulettes, we have 38 slots (labeled 1 to 36 and, additionally, a
so
0 and a 00 slot). Here X an take values 0,00,1,2,...,36, and we would
1
2
assign probability
38 to eah . A yet more elaborate example is the throw
of two die. It turns out that the best model onsiders as equally likely
all pairs
(i.j), where both i and j
take on all possible values between 1 and
6. There are 36 suh pairs, so any outome
1 3
of
36 .
(i.j)
is assigned a probability
2. The Frequentist Model. This is the usual model applied when addressing
situations that are (or an be) repeated may times over.
It is loosely
based on a basi theorem in probability, but that's not its foundation,
as it would lead to a irular argument.
outome has probability
p,
Basially, it states that if an
then, if we repeat the observation over, and
over again, making sure that eah observation is not inuened by any of
the previous ones (for example, when ipping a oin we are not heating
so that it will turn up as in the previous toss, or vie-versa, both of
whih heats are pretty easy to do), the
4
frequeny
of ourrene of this
outome will approah its probability . Thus if we are tossing oins and
P [X = 1] = 21 , over many tosses roughly half the outomes should turn
1 This assumption goes under the moniker of Priniple of Suient Reason
2 Real life roulettes, just like real life oins, are unlikely to be so neatly onforming to the
ideal. You have possibly enjoyed one of the several movies where lever gamblers were able
to spot the bias in a roulette, and/or how the roupier manages it, and gain a lot of money
in real asinos, anybody suspeted of keeping trak of long-term roulette outomes is very
likely to be thrown out right away (this has to do with the seond meaning of probability,
disussed below)
3 Of ourse, an argument similar to the one in the previous footnote applies here too.
4 This goes by the name of Empirial Law of Large Numbers. We will mention the
mathematial Law of Large Numbers, whih might look similar, but is logially unrelated.
That this empirial law should hold ould be taken as a denition of probabilisti experiment,
in the sense that if repeating an experiment over and over seems to indiate that the frequeny
of ourrenes does settle to a spei value, we an assume that probability is a good model
for this experiment.
2 A model for statistial observations
4
5
out heads . As you will realize, this approah is the foundation of what
goes under the name of Classial Statistis, where the probabilities in a
model are assigned based on repeated observations of the phenomenon in
question,
3. The Subjetive Model. This is partiularly popular when addressing onetime events (as opposed to repeated experiments), where the frequentist
approah is simply impossible. In this model, probabilities are
assessments of how likely a given event is to our.
subjetive
The lassial way
of doing this is to ask the subjet to provide odds that s/he would be
willing to aept on a bet over the outome. This approah has evolved
into a more systemati method, known as
Bayesian Statistis
that has
gained onsiderable ground lately. Relying on a deeptively simple theorem, known as
Bayes' Theorem,
it goes like this: given the problem of
p = 12 ?), we
as a random variable (after all, we do not its value
assessing a probability (for example, is this oin fair, that is
start by onsidering
p
for sure), and assume an
a priori probability distribution to it (a popular
uniform distribution over [0, 1], as dened
hoie in this ase would be a
below). We now perform repeated tosses, and use the outome to hange,
if warranted, our initial guess, using Bayes' Theorem to do so in a preise
way. This method requires, in pratial situations, very large omputing
power, and the urrent availability of suh power has made it very popu-
Big Data are mostly based on Bayesian
We will not address Bayesian tehniques in this lass, as they
require signiant mathematial developments, not to mention the short
time we have on our hands.
lar. In fat, rigorous approahes to
methods.
Classial statistis, whih is what we will look at, is based on model number
2. Given our time limitation, we will onsider only the simplest appliations,
whih, however, should help you get the feeling of its methods.
This is still
the most ommon approah in medial and soial sienes, even though it has
6
been questioned for a number of reasons . It may easily happen that you may
end up using its tools in your future job, and, with an understanding of the the
philosophy that underpins it, you should have no trouble adopting methods that
we will have had no time to look at, but whih are based on the same priniples.
5 This does not mean at all that, if we toss a fair oin enough times, say, 1,000,000, the
number of Heads results will approah half of the number of tosses, say 500,000. If we toss a
n
1
oin N times, and n outomes are heads, we would expet
to approah
, but that's very
2
√N
N
n
N
+
≈ 21 , while
dierent from n approahing
: for example, if n ≈
N
, we will still have
2
2
N
n is drifting away from N
2
6 The ritiisms levied on experiments based on lassial statistis are not trivial. It is true
that a number of faults refer to researhers not following proper statistial methods, thus
drawing onlusions that are unwarranted even within the lassial framework. What is more
troubling is that, in a distressingly large number of ases, properly obtained results ould not
be repliated by independent researhers or even by the original researhers. Of ourse, the
gold standard of sienti researh is reproduibility of experimental outomes, so this is a
very serious issue. The arguments about this are ongoing and denitely beyond our sope,
and the best we an do, for now, is be aware that statistial statements need to be taken with
a large grain of salt.
3 Probability Models
5
3 Probability Models
3.1 Generalities
3.2 Distributions
X that may take n possible values x1 , x2 , . . . , xn ,
pk = P [X = xk ]. By onsisteny, we need pk > 0, and
p1 + p2 + · · · + pn = 1. For brevity
Pn the sum on the left hand side is usually written using sigma-notation:
k=1 pk . The olletion of numbers p1 , p2 , . . . , pk
is alled the probability distribution of the random variable X .
Let's onsider a random variable
with probabilities
It is rare that we will look at a single random variable; it is ommon to have
to look at more than one quantity, and at the very least, even if we are looking
at a single quantity (like, say, the onentration of a hemial in body of water),
we are well advised to take more than one measurement, resulting in observing
several random variables (presumably all with the same distribution). In these
ases we will be looking at (for simpliity, let's look at two random variables)
P [X1 = xi , X2 = xj ] = pij
(the omma stands, as is ustomary in probability
for and: we are looking at th probability that
the value
xj .
distribution
X1 take the value xi , and X2 take
pij , this is often alled the joint
Considering all possible values of
of
X1 and X2 .
Determining the joint distribution of several random
variables is, in general, umbersome, and requires some spei information on
how eah may aet the others, but there is one ase when things are simple:
when the variables are
Denition
independent.
Random variables
X1 , X2 , . . . , Xr
7 if all joint prob-
are independent
abilities fator: for all possible hoies of
x1 , x2 , . . . , xr
P [X1 = x1 , X2 = x2 , . . . , Xr = xr ] = P [X1 = x1 ]·P [X2 = x2 ]·. . .·P [Xr = xr ]
In this ase, determining the joint distribution is equivalent to determining
the distribution of eah random variable separately. In partiular, if all
random variables have the same distribution, the olletion is said to be
independent, identially distributed random variables (i.i.d.), and
simple random sample of the ommon distribution.
7 A motivation for this denition relies on the notion of onditional probabilities. This is
a set of
is alled a
an important onept, but we an skip it for the strit purpose of olleting fats we will need
for our statistial tools. Do look at the orresponding setion in the book, though.
3 Probability Models
6
3.3 Spei models
3.3.1
Disrete 8 distributions
In most of our appliations, we will assume that the distribution we are working
on belongs to a spei lass. Here are three of the simplest:
•
Bernoulli Distribution :
a random variable with a Bernoulli distribution
an take only two values, for example 0 and 1.
P [X = 1] = p,
P [X = 0] = 1−p.
Thus, we will have
p depends on the spei
question ould be to determine p
The value of
ase we are dealing with. A statistial
from observations of several random variables all with the same Bernoulli
distribution (of a simple random sample of Bernoulli distribution)
•
Binomial Distribution:
It turns out that if
X1 , X2 , . . . , Xn
is a simple
random sample of a Bernoulli distribution with parameter p, The sum
Pn
Y =
k=1 Xk of these independent variables, whih will take values
0, 1, 2, . . . , n,
has a distribution given by
P [Y = r] =
n!
pr (1 − p)n−r
r!(n − r)!
(n!, read n fatorial, is dened as the produt of all integers between 1
and
•
n: n! = 1 · 2 · 3 · . . . · n)
Geometri Distribution :
Consider a sequene (potentially unbounded) of
independent Bernoulli variables
X 1 , X 2 , X 3 , . . ..
We look at the rst vari-
able that takes the value 1. It ould be the rst, seond, and so on in
priniple, we ould keep going and never get a 1, of ourse. Call
index of this variable. Its possible values are
1, 2, 3, . . . ,
G
the
without bound.
By independene, it is easy to see that
P [G = k] = P [X1 = 0]·P [X2 = 0]·. . . P [Xk−1 = 0]·P [Xk = 1] = (1−p)k−1 p
(Note that, as
k
beomes large, sine
1 − p < 1,
the probability that
takes this large value beomes smaller and smaller, approahing 0 as
beomes really large)
9
G
k
8 That means that the values our variables an take are disrete, i.e., an be listed in
the examples here, the values are positive integers, as an example. In priniple, sine we an
only observe quantities up to some nite preision, and within some nite range, all random
variables are disrete. However, as we disuss in the next subsetion, it is unwieldy to work
with a really huge number of values that are very lose to one another (like, say, all frations
n
, for all integers n).
of the form
1010
9 This is the standard model for playing the lottery or any other game of pure hane, where
eah round is independent of the others, and your probability of winning is
p
p.
Thus, even if
very small, the probability of never winning beomes eventually very small. Unfortunately,
it may take several lifetimes to make this probability really small. Also, as an be seen by
independene (and more preisely, using onditional probabilities), things do not improve if
you play, say,
N
times, and never win: the next rounds are independent of the ones you
played, so your hanes of winning, given that you lost
at the start.
N
times, are the same as your hane
3 Probability Models
•
7
Uniform distribution
over n values a1 , a2 , . . . , an . That's a distribution
1
that assigns probability n to eah of the values. That's the distribution
that the lassial method assigns to the possible values.
Of ourse, there are many other disrete distribution models (an important one
Poisson model
is the
, where X an take any integer value and P [X = k] =
λk −λ
, here λ > 0 is a parameter), but we will atually not work with any of
k! e
these disrete distributions diretly.
3.3.2
Continuous Distributions
Suppose you are timing the arrival of the rst ustomer at a servie station.
If that's a bank teller, you might reord the time with a preision of a minute.
If it is a omputer network servie, like a printer or a web page request, the
timing ould be preise up to the hundredth of a seond.
In any ase, the
possible values of your observations are a very large set, they are very lose to
eah other (the dierene between 2 and 3 minutes, is just one minute, and it's
muh worse if you are at the hundredths of seonds), and the probability of an
event ourring exatly at a given value is extremely small (eventually, an event
will our and that will lead to a value for your observation, but, a priori, it is
extremely unlikely that the rst web page request will our at any xed time
point).
When faed with these situations, whih are extremely ommon, it makes
little sense to onsider a list of possible outomes, and it is muh more pratial
to onsider the outome set to onsist of all real numbers, with probabilities
assigned (in a onsistent way) not to individual outomes, but to intervals So
we will not ask, for example, what is the probability that our rst web page
request will our after 230.302 seonds (a highly unlikely event), but rather,
say, what is the probability that our rst web page request will our in the
interval between 200 and 300 seonds. Thus, rather than looking at statements
of the form
<
P [X = x], we will look at statements of th form P [a ≤ X ≤ b] (with
≤, depending on preferenes). In this ase, we talk of ontinuous
instead of
random variables.
With some advaned math tools, one an build a omplete theory, starting with approximations from disrete random variables, and this onstrution,
ompleted in the 1940s by the great Russian mathematiian N.N. Kolmogorov,
provided the rigorous foundation for probability as a ompletely legitimate part
of mathematis.
We will onsider only a less general lass of ontinuous variables, so alled
absolutely ontinuous
random variables.
tene of a ontinuous funtion (alled a
for any
a
and
b, P [a ≤ X ≤ b]
They are haraterized by the exis-
probability density ) f (x) ≥ 0, suh that,
is given by the area of the portion of the o-
and the horizontal axis, between a and
´b
b. From alulus, this area is denoted as a f (x)dx. Note that the area below
a single point is 0, orresponding to the fat that we annot really onsider
ordinate plane between the graph of
f
the probability of a spei single real number ourring as outome thus
3 Probability Models
8
P [a < X < b] = P [a ≤ X ≤ b] = P [a < X ≤ b] = P [a ≤ X < b]
ontinuous random variables.
a = −∞, b = ∞),
and sine an outome has to our, this means that
P [−∞ < X < ∞] =
(note the
<,
for absolutely
This an extend to the whole real line (as in
ˆ
∞
f (x)dx = 1
−∞
sine ∞ is not a number, but a symbol indiating without
bounds, and thus no variable an be equal to innity).
It is ommon usage to onsider, instead of the density, the so-alled
lative distribution funtion
umu-
(df ), dened as
F (x) = P [X ≤ x] =
ˆ
x
f (t)dt
−∞
By onsisteny,
P [a < X ≤ b] = F (b) − F (a)
so that knowledge of the df allows us to ompute the probability orresponding
to any interval.
A ommon model for the rst arrival problem we started with is the exponential distribution, where the density is given by
f (x) =
where
λ > 0 is a parameter.
(
0
λe−λx
x<0
x≥0
It turns out that, for exponential distributions, the
df is given by
F (x) = 1 − e−λx
A simple and useful ontinuous distribution is the
an interval [a, b].
uniform distribution over
[a, b], proporP [c ≤ X ≤ d] =
[0, 1], if 0 ≤ c ≤
This assigns a probability to any subinterval of
tional to the length of the sub-interval. Thus, if a ≤ c ≤ d ≤ b,
d−c
b−a . In the useful speial ase of a uniform distribution over
d ≤ 1, P [c ≤ X ≤ d] = d − c.
A very ommon model, motivated by a basi theorem that we will disuss
momentarily, is the
Gaussian
ters, ommonly denotes by
µ
Normal
or
and
σ,
distribution, dened by two parame-
whose density is given by
(x−µ)2
1
f (x) = √
e− σ 2
2πσ
X ∼ N (µ, σ) for suh a random variable.
µ = 0, σ = 1, the so alled standard normal distribution.
X−µ
that if X = N (µ, σ), then the new random variable Z =
σ
You will often see the notation
A
speial ase is
It
turns out
is
N (0, 1).
The df for a normal random variable annot be written in terms of
familiar funtions, but the df for
N (0, 1) variables, often denoted by Φ, has been
alulated to any desired preision, and is available as a table in all statistis
10
and probability books
10 Spreadsheets and statistial software will be happy to give you diretly the df of any
3 Probability Models
9
3.4 Indexes
It turns out that some quantities about a distribution an be onvenient to
evaluate and use.
In fat, when onsidering a spei lass of models (like,
binomial, Normal, and so on), these quantities are usually suient to speify
the model ompletely. The main ones are:
•
Expetation
mean
(also expeted value, and sometimes
): in the disrete
Pn
EX =
k=1 xk P [X = xk ]. With a little more advaned math this denition extends to the ase when there are innitely
ase it's dened as
many values and when the variable is ontinuous. If
distribution it is easy to see that
X
has a Bernoulli
EX = p.
If it has binomial distribution,
1
then EX = np. For the geometri distribution, EX = . For the expop
1
nential distribution, EX =
λ , and for the normal distribution EX = µ.
As you an see, in all but the last example EX fully speies the model.
The expetation of
the sum of n random variable is equal to the sum of the expetations.
We note the remarkable fat (also easy to prove) that
•
Variane :
that's the expeted value of the
square of the dierene between
the random variable and its expetation:
i
h
2
V ar [X] = E (X − EX)
Using the square makes the math easier (it may not be obvious, but that's
the ase), and it makes sure we ount positive and negative dierenes
1
equally. Exponential distributions have variane 2 , and normal distribuλ
2
tions have variane σ . The variane of the sum of n random variable is
not given by the sum of the individual varianes at all, exept in the ase
of
•
independent 11
Moments :
variables
more generally, we an dene
r
Mr = E [(X − EX) ]
as the
r−th
moment. Conventionally, the expetation is onsidered the
rst moment. The variane is, obviously, the 2nd moment. We will not use
these indexes, and variations on them, in our examples. Note that when
innitely many values are possible, it may happen that higher moments,
or even the variane or even the expetation, may not be dened. It is
not hard to show that if the
r−th
moment exists, then all lower moments
are also dened.
normal variable, not only standard.
N (0, 1),
using the fat that if
Z
When these are not available, we an use a table for
is standard then
X = µ + σZ
is
N (µ, σ).
11 Atually, the ondition for the variane of a sum to be the sum of the varianes is far less
restritive, but in our appliations, the deiding fator will always be independene
3 Probability Models
•
Perentiles
10
or
In partiular,
Quantiles :
q 21
These are numbers
is alled the
median, q 41
the
qp
suh that
P [X ≤ qp ] = p.
rst quartile, and q 43
the third
quartile (the median is the seond quartile). You may also read about
k
(quantiles suh that p =
5 for k = 1, 2, 3, 4), or perentiles,
k
referring to p =
100 , for k = 1, 2, . . . , 99. These are all exatly dened
for ontinuous random variables, but not for disrete ones, where there
quintiles
are obvious ambiguities. There are onventions set up to dene a unique
quantile for disrete random variables, so that your spreadsheet will ome
up with a spei number, but they are just that onventions. For
1
4 to the values 1, 2,
3, 4, any number between 2 and 3 qualies as a median, even if the most
example, if your random variable assigns probability
ommon onvention is to all 2.5 the median. If a distribution has a density
with a graph symmetri around a value, that value will be equal to the
expetation
and
to the median that's the ase of a normal distribution,
where both are equal to
µ.
Laking this symmetry, the two are dierent:
1
λ , while the median is the
the expetation of an exponential variable is
solution to the equation
1 − e−λx =
that is
1
2
1
2
λx = ln 2
ln 2
0.693
1
x=
≈
<
λ
λ
λ
e−λx =
Thus, an exponential random variable is more likely to take values smaller
than the expetation, rather than larger.
•
Mode :
This is an index we will have no use for, and it is only meaningful
for (some) disrete variables. This is the value for
probability
•
12 .
Mean Absolute Deviation :
X
that has the highest
We will not have any use for this as well. It
is the measure of dispersion best assoiated with the median: if
median
m,
the MAD is
E [|X − m|].
X
has
There are statistial appliations for
this measure and its assoiated index, the median, but they are tehnially
less easy than the more ommon team expetation-variane. You will
see that some books dene the MAD using the expetation instead of the
median in the formula above, but this is a poor hoie, with no rigorous
reason.
We will onern ourselves only with
parametri statistis.
This means that we
will observe random events that we assume an be desribed with a spei
distribution form (e.g., normal, exponential, and so on), so that our problem
12 For absolutely ontinuous random variables, the mode is often dened as the value(s) for
whih the density has a maximum.
11
onsists in identifying the parameter(s) that haraterize the distribution and
whih, in all pratial ases, are expetation and/or variane and/or other moments, or immediately onneted to them (e.g., the expetation of an exponential random variable is the reiproal of the parameter we denoted by
λ).
Also,
we will only look at (absolutely) ontinuous distributions. One an apply this
methodology to disrete distributions, but this requires more work, and is not
as ommon in pratial appliations.
Part II. Estimating Distributions
4 Simple Random Sampling
Suppose we want to estimate the onentration of a given hemial in a body
of water. We don't know what the onentration is so we onsider the result of
the measurement random, a random variable, say
C.
If we take more than one
measurement, we an be ondent that we will get dierent numbers, beause of
a variety of fators (the onentration will not be exatly uniform over the body,
our measuring instrument will have its own irregularities, and other fators that
we may not even be aware of ). We model this as a probability distribution for
our random variable
C.
We will have to assume that this distribution belongs
to a spei family, sine we are only doing parametri statisti, and we will
want to identify, as an example, the mean of this distribution.
The main tool for this projet is to perform several observations, all under
the same ondition, and in a way that outomes do not aet eah other, whih
is modeled as observing a number of independent identially distributed random
variables, what is known as a simple random sample : C1 , C2 , C3 , . . . Cn . The
outome of this experiment is a set of numbers
ame bak the next day and took
n
c1 , c2 , c3 , . . . , cn .
Of ourse, if we
more samples, the outputs would be most
likely dierent from these.
Two basi theorems motivate the alulation of a rst summary for these
observations:
4.1 The Law Of Large Numbers
This theorem (abbreviated LLN), possibly the rst hard theorem in probability, says that
Given a simple random sample
tion with expetation
positive number
(where
ε
δ,
µ,
as
n
X1 , X2 , . . . , Xn ,
from a distribu-
beomes larger and larger, for any
the probability
Pn
Xk
− µ < δ > 1 − ε
P k=1
n
is any given positive numbers presumably very small).
In words, the probability of the sample mean to be very lose to the
4 Simple Random Sampling
12
expetation beomes lose and loser to 1. If
δ
and
ε
are very small,
it is almost ertain that the rst expression will be pratially equal
to the expetation however, this may require a very large sample.
Pn
k=1 Xk
Thus, the quantity X =
, alled the
, will be very lose to
n
the ommon expetation with extremely high probability. In many rough and
sample mean
tumble appliations, we may thus make a number of observations, and take the
resulting sample mean as a reasonable estimate of the expetation. This is often
expressed by saying that the sample mean is an
estimator
of the expetation.
The sample mean ould have a very dierent distribution than that of eah
observation. However, we may note the following properties: if the (indepen2
dent) observations have all expetation µ and variane σ ,
E X = µ,
σ2
V ar X =
n
(2)
4.2 The Central Limit Theorem
The LLN tells us that the sample mean will very likely be lose to the expetation, but we would like to know how likely this is.
In general, this is a
diult estimate, but, for reasonable distributions, there is a shortut, that we
will use extensively. This theorem (abbreviated CLT) says, very roughly, that
for large enough simple random samples, if the original distribution is nie
enough (tehnially, this means that at least four moments are dened), the
sample mean will be approximately distributed as a normal random variable.
Tehnially, this is expressed by saying that
P
beomes loser and loser to
as
n
√ X −µ
n
≤t
σ
Φ(t),
the df of the standard normal distribution,
grows.
We may want to remark that if the individual observations in a random
sample are normally distributed,
N (µ, σ),
the sample mean,
X
is also normally
σ
distributed, as N µ, √
. One again, for this approximation to be eetive,
n
the sample size n has t be large enough. How large? That depends on the spei
distribution we are working with. As long as the distribution we use as a model is
reasonably symmetrial, and reasonably onentrated, the approximation kiks
in fairly soon
13
This theorem an be read in a few dierent ways.
One that has many
appliations in modeling reads the statement as stating that
n
X
Xk − µ
√
n
k=1
13 A lassi ase is a uniform distribution over [0, 1], where a sample of size 12 an be assumed
to be large enough, so muh so that for many years IBM omputers used this fat to simulate
a normal distribution, by adding 12 uniform random variables.
4 Simple Random Sampling
13
is approximately normal with mean 0 and variane
σ2 .
If
n
is large, eah
the sum of many small
independent variables with mean 0 is approximately normal. This justies, for
term is a small random variable with mean 0. Thus
example, Maxwell's law for gases, stating that the veloity of a gas partile is
normally distributed: this veloity is the result of very many very small ollision
eah partile undergoes with the other partiles, and the umulative eet results
in a normal distribution. A similar argument justies the standard theory for
measurement errors: measuring a physial quantity, say the mass of an atom, is a
high preision operation, but outomes are aeted by very small inontrollable
external fators (osmi rays, minimal earth temblors, ...) so we may assume
that we are observing a normal random variable, entered on the true value
of the mass. Albert Einstein's Nobel prize was not awarded in reognition of
his Theory of Relativity, but rather on his previous work, inluding his theory
of Brownian Motion, the errati motion of partiles suspended in a liquid,
observed by the biologist Brown a few deades earlier.
Einstein suessfully
modeled the motion of the partile by assigning a normal distribution to its
position at any instant of time, as a result of many small impats of the moleules
14 .
in the liquid with the partile
We will rely on this theorem extensively, but we need to remember that it
applies to olletions of
the speed at whih
Φ
independent identially distributed
observations. Also,
is approahed depends on the original distribution: the
less symmetri it is, the slower the speed. Forgetting any of these onditions
an lead to erroneous onlusions
15 .
14 This result was expanded and made more rigorous and also riher by Norbert Wiener, the
Brownian Motion, another deade later, leading to a vast
Random Proesses.
proper father of the mathematial
new eld, the theory of
15 By now, a standard example of erroneous onlusions is provided by the housing rash
of 2007. The ompliated nanial onstrutions built around housing mortgages was prediated on the assumptions that default risks were essentially normally distributed. This may
be reasonable in a normal environment, where one household's default need not have any
eet on others. In the sub-prime lending frenzy, however, defaults snowballed, reating an
unexpeted large risk, as banks and eventually governments realized the hard way. Careful
observers, inluding ones in the lending ommunity, were aware of this threat, but alls to
aution were largely ignored before it was too late.
An otherwise very areful, somewhat less elementary, textbook argues that, without any real
information on the distribution underlying a statistial experiment we should automatially
assume it is Gaussian.
While this orresponds pretty muh to ommon pratie, it is not
neessarily a wise hoie under any irumstane, as the market rash of 2007 illustrated
starkly.
4 Simple Random Sampling
14
4.3 Empirial distribution and estimators for expetation
and variane
If we observe a simple random sample, resulting in
n
numbers
x1 , x2 , . . . , xn ,
from a distribution with (theoretial) expetation µ and (theoretial) variane
σ 2 , we may want to use our sample to get a grip on these values. The LLN tells
us that
n
1X
xk
x=
n
k=1
is an estimator for the expetation
as a
random distribution
µ.
In fat, it is useful to onsider our sample
(alled the empirial distribution), where the possible
1
n . Of ourse, if we
repeated this experiment, the numbers obtained would be dierent, and hene
values are the
n
observations, and eah is given probability
the distribution would be dierent hene the
random
qualiation. As you
will notie, an empirial distribution is a (disrete) uniform distribution over
the observed values.
Thus, the sample mean is the expetation for the empirial distribution.
Applying our denition of variane, we have that the analog for the empirial
distribution is given by
n
1X
2
S =
(xk − x)
n
2
(3)
k=1
(the notation
S2
it onfusing with the variane of a non random
S2
σ 2 , but that makes
distribution). Note that x, and
is not partiularly ommon the book uses
are values taken by
random variables,
sine another round of observations
will produe dierent values.
It is useful to note (it is not hard to see this, with only a little algebra)
E X = µ,
n−1 2
σ
E S2 =
n
unbiased
Tehnially, the terminology is that X is an
estimator for the ex2
estimator for the variane. This is of minimal
petation, while S is a
biased
signiane in pratie (in statistis, we don't know the value of the expetation
or the variane anyway). More signiant is the fat that, thanks to the Law
of Large Numbers, both are
onsistent, that is are more and more likely to be
loser and loser to the true values as
n
gets large).
n
grows (note that
n−1
n
≈1
as soon as
Atually, looking at the df of the empirial distribution, it an be shown
that, as
n
grows, it will approah the df of the original distribution. This is
the basis for many
non parametri
statistial proedures, whih, however, we
will have no time to onsider.
Finally, note that an empirial distribution is just another disrete distribution (albeit a random one), so other indexes, suh as quantiles, median, mode,
and so on all apply (being disrete, quantiles are usually ambiguous, with dierent onventions adopted by dierent authors to resolve the, atually harmless,
ambiguity).
4 Simple Random Sampling
4.3.1
The
n−1
15
story
Early in the 20th Century a quality ontrol employee of the Guinness brewery
in Dublin named Gosset developed a new statistial proedure to estimate the
quality of the business's brew, to remarkable suess. His proedure was based
on a ombination of the sample mean and the variane dened in (3).
His
work was greatly appreiated by one of the main founders of modern lassial
statistis, the biologist Ronald Fisher. Fisher however was also enthralled by a
way of lassifying random quantities onstruted from samples that has useful
traits in the Gaussian ase, but not otherwise. Gaussian models are the most
ommonly used in lassial statistis, so this is not neessarily an odd hoie, but
it has no bearing on the partiular problem that Gosset addressed. Nonetheless,
2
Fisher judged that the proper estimator for σ was not (3), but the unbiased
orretion
n
s2 =
1 X
2
(xk − x)
n−1
(4)
k=1
whih is now ommonly alled
variane.
sample variane, with (3) oddly alled population
Further, Fisher reworked Gosset's method using (4), rather than (3).
Sine Fisher wielded (and still wields) enormous authority, this terminology and
his reworked method beame the standard, but you should be aware that the
s2 instead of S 2 is purely a historial aident, with no greater
dominant use of
signiane.
One obvious fat is that
stead of
n − 1),
S 2 < s2
(the sum of squares is divided by n inσ 2 , s2 is
so that, if you are using these as bare estimators for
more pessimisti
16 . However, this is a rough proedure and as far as the more
areful proedure we will look at it makes absolutely no dierene (exept in
tweaking formulas in minor formal details) whih one you use.
This has not
prevented some books to try to justify Fisher's hoie with bogus arguments.
One absolutely galling one (in a widely used introdutory statistis textbook)
2
2
states outrageously that it is always true that S < σ . This is ridiulous
of ourse (what is worse, the proof of this statement is based on a small ad
ho example I'll be happy to produe a similarly irrelevant example where
S 2 > σ 2 ).
In fat, sine both empirial varianes are never negative but an,
in priniple, take large values, their distributions (whih are pratially idential) are not symmetri, and, generally, the median will be smaller than their
expetation (this an be heked expliitly in the Gaussian ase), so that
17 .
both
are more likely to underestimate the true variane
16 Of ourse, you will also notie that, as soon as we are looking at samples that are not
very small, the dierene between dividing by
n
or by
n−1
is almost irrelevant, espeially
when data omes from experiments where the preision is limited.
17 Another, generally very areful, textbook states erroneously that the fat that s2 is unbi-
ased means that half the time it will underestimate and half the time overestimate the true
variane. Confusing the median with the expetation is quite a slip for an otherwise arefully
written text.
5 Other sampling methods
16
5 Other sampling methods
In our ourse, we will always assume that the observations onstitute a simple
random sample, but in real life other, formally more elaborate methods are
also in use. None of the tehniques we will look at applies diretly to samples
obtained in these other ways, some of whih are atually useless for any rigorous
analysis. To be sure, generating a proper simple random sample an be very
hallenging, if not almost impossible, but alternate methods fae as many, if not
more, hallenges.
5.1 Simple random sampling in pratie
Dierent experiments may require dierent approahes. We look at two typial
examples
5.1.1
Physial measurement
This is a straightforward ase, only needing are in setting up the experiment.
For example, repeated measuring the ontent of a hemial in a body of water
an be thought as produing independent identially distributed results, if it is
reasonable to think that thee is no reason for the onentration to be signiantly
higher in one spot or another, and if we make sure that our instruments is reset
every time, so that it has no memory of the previous measurement. Sine the
variations in measurements may be asribed to many small eet, it would be
reasonable to assume that the measurements ome from a normal distribution,
σ 2 ) of
and we may be looking at determining one or both parameters (µ and
this distribution.
5.1.2
Polling
This is more omplex, even if it is the situation you are most likely to have
heard about. Taking a simplied model, assume you are wondering about the
opinion of a population, let's say of the United States, about something. You
annot interview every single individual in Ameria (not to mention that there
are ontinual hanges as individuals depart, die, are born, immigrate
18 ), so you
hoose a sample, that is a limited number of individuals, trying to extrapolate
the outome of your inquiry to the whole population. In pratie, serious polls
will interview at least 500 to 1000 individuals. How ould you hoose the people
to interview so that the simple random sample model ould be reasonably
applied?
18 The United States has a ensus every 10 years, when the Census Bureau tries to ount
every individual living in the ountry on a given day. This is, obviously, not a simple random
sample, and it present its own spei statistial problems. The general ensus does not go
beyond ounting people, but, in order to gain more detailed information, a limited number
of households are asked to omplete a long form, and this should be a random sample, or,
possibly, a stratied sample, as desribed below.
5 Other sampling methods
17
The standard model, taken from lotteries, for this is to think of the population as a huge olletion of balls, marked 1 for I like and 0 for I don't
like
19 . You mix all the balls and take out of the 350 million+ balls 500 or 1000.
If the extration is done properly (say, the balls have been mixed thoroughly) it
is reasonable to assume that if the proportion of 1s is
ity
p
p, then there is a probabil-
of piking a 1. The total number of balls extrated is a tiny proportion of
the whole, so that, even if you are sure not to risk piking the same ball twie,
the fat that on the seond, third... extration the proportion of 1s and 0s has
slightly hanged (beause of the balls you already took away), the hange is not
measurable.
This implies that you an look at your total result as a binomial experiment
(500, 1000, or whatever your sample size is) and p. The
p and variane p(1−p)
, but, more importantly,
n
due to the CLT, it will be approximately normally distributed (provided that
with parameters
sample mean,
at the start
p
X
n
has expetation
was not too lose to 0 or 1).
What problems do you fae in applying this model?
Many, and growing.
First, you do not have 350+ million balls to shue, so you need to nd a method
for hoosing people at random, with a omparable randomness as the one
you get by extrating balls from a well shued urn.
The traditional method
relied on the fat that the vast majority of households was listed in telephone
diretories, and methods were developed to pik numbers at random from these
diretories. It was not a perfet method, but proved to be good enough for most
ases. Reently, the massive swith to ell phone servie has made this method
less and less reliable (in partiular, sine having no land line is partiularly
frequent in younger generation, sampling this way reates an unwanted bias
towards the older population). This issue is very high on the mind of pollsters
and is thought to be at least a fator behind some massive failings in foreasts
in reent eletions in many ountries.
There are other issues with our model, that are intrinsi, and not due to
soiologial trends. A big one is the fat that not everybody piked will answer
either they will not be home when alled, or the just hang up right away. How
to handle these no shows requires more math, and many polling businesses will
adopt their own speial method.
A more subtle problem is, of ourse, that people will lie, espeially if the
question is deemed to be sensitive. That's something that requires even more
reativity to be taken into aount. In politial polls, an extra issue is that in
ountries like the US where many people do not vote at all, sampling the likely
to vote segment is another deliate problem, as relying on self-desription as
a likely voter is not neessarily a reliable method (another variation on the
lying problem).
19 Of ourse, most polls have more than two possible answers, but the extension is easy, only
requiring a little more math
5 Other sampling methods
18
5.2 Other methods
5.2.1
Stratied sampling
It may be more pratial to divide the general sample into sub-samples by geography, soial features, and so on. One reason may be pratial: rather than
extrating 1000 addresses out of 350 million, we might prefer to extrat smaller
numbers from more limited groups. Another onern is that sampling from the
whole population may easily miss some groups (minorities, low inome, speial
ommunities) altogether, though they may be numerous enough and spei
enough to aet the overall results.
This is
not
so muh done to make the
sample more representative (piking representative, typial samples is premodern statistis), but rather as an attempt to redue the unertainty intrinsi
in sampling by ontrolling it, splitting it between groups (this sampling is onneted to
variane-redution
methods).
The diulty in using this method is that to merge the separate polls, you
need information on how the various sub-groups ombine into the population,
that is you need preise soiologial data on the various sub-groups.
5.2.2
Systemati Sampling, often alled Every other
k
In quality ontrol on a prodution line, to hek if the produts live up to spes,
one an start by hoosing an integer
and after that pik the
k−th, 2k−th,
k > 0,
then pik a rst item at random,
... items for inspetions. This proedure
will produe a reasonable andidate for simple random sample only if some
spei assumptions about the prodution line are satised.
Of ourse, if the produts in the line have a quality that an be thought of as
independent and identially distributed, it doesn't matter how you pik items
(you ould just as well hoose
k = 1).
That would be quite a gutsy assumption.
In general, to make this proedure work, you still have to assume a spei
(even if fairly ommon) model for your line (in tehnial terms, it must look like
a
stationary random proess, with short-term orrelations
(we skip the rigorous
denition). This an be often reasonable, but if the line exhibited, for example,
a periodi reation of defets, this proedure would fail ompletely.
In other
words, its validity depends on a good understanding of how possible defets
might enter in the prodution proess.
5.2.3
Blok (Cluster) sampling
Again, in quality ontrol, prodution items might be lumped in bloks, and
one (or more) bloks hosen at random for quality testing.
One again, the
validity of this proedure depends on strong assumptions on how these bloks
are formed.
Otherwise, this an easily beome a speial ase of onveniene
sampling, as desribed momentarily.
5 Other sampling methods
5.2.4
19
Conveniene sampling
This is not even a statistial sampling method, and should not be listed together
with the previous methods (but it is in most textbook, so here it is). It onsists
in hoosing a sample that is right away available instead of piking one at
random. For example, you ould walk to a mall and ask the people you ome
aross there. Variations are self-seleted samples, as in all-in or in Internet
polls, where the respondents volunteer their answers, whih is a variation on
onveniene sampling. A famous example was the predition that Dewey would
win the 1948 presidential eletion over Truman.
The poll was the result of
responses by the readers of a high end magazine, a typial self-seleted sample,
taken from a spei small minority of the population. Any statement based on
these proedures an be dismissed outright, sine it holds no more ontent than
your personal opinion would. In general, onveniene sampling will pratially
always (not simply often) produe biased, hene useless, results.
It should be noted that some standard proedures, espeially in the soial
and medial sienes, may lead, in a subtle way, to what is essentially a onveniene sampling.
For example, many soial and psyhologial studies rely
on samples onstruted from the student population of the institution(s) involved in the experiment. The fat that many of these experiments ould not
be reprodued has been asribed (among other explanations) to dierent student populations, often from dierent ountries, having signiantly dierent
responses to the irumstanes of the experiment. Similarly, in medial trials,
the individuals are volunteers, hene self-seleted.
While eorts are made to
limit the bias that this may produe, there is always the possibility that failure
to repliate the results of a trial may be due to the onveniene sampling risk
that is impliit in these methods.