Download STATIST - Harvard University Department of Physics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
FRESHMAN SEMINAR
Professor Richard Wilson
Fall Term 2004
Notes on Statistics
You do not need to know much statistics to understand the matters in this course, but it is
important to have a very clear understanding of the following two fundamental concepts:
Randomness
Statistical independence
You might want to read the introductory chapter of any statistics textbook. Often
recommended is Probability and Statistics for Engineers and Scientists, by Walpole and Meyers
(McMillan). See also Xerox notes, Yardley Beers: "Theory of Error".
I will merely put a few notes here on what you may need during the term. (We will not
need them all at once) We also need to recognize the notation and jargon that statisticians use
among themselves. But please statisticians: when talking to others try to avoid the jargon.
Seminar example:
In order to give a feeling for how one can combine random quantities to get a (fairly)
definite result, we will start with a simple seminar exercise on random numbers.
Each member of the seminar will get a (different) sheet of 100 digits chosen at random
from the numbers 0 to 9. You will be asked to generate the means of rows, columns, and means
of the total; and the standard deviation thereof. Then we will compare in the class everyone's
mean, and see whether the class mean is within the expected range.
This brings up at once the concept of independence. The fact that the class mean is
closer to the limit of 4.5 depends upon the fact that everyone has a different list of random
numbers that is independent of the list of every other student. If everyone had the same list, the
class mean could well be further from 4.5.
2 years ago I made the mistake of getting the random numbers from the standard
EXCEL
program. When we went through this exercise we found that they were not random!
(Although they appear ed to be so superficially) Anyone who has the enthusiasm can go to the
EXCEL program and prove this for him/herself. I now have an ADD IN for EXCEL which
does a better job.
Calculations of probability
The numerical estimates of probability are derived from the number of different ways of
choosing the group concerned from the total sample, eg:
A poker hand has 5 cards; what is the probability of getting 3 Jacks and 2 aces?
1
We address this by first addressing the partial question: What is the number of ways of
getting two out of the four aces? The total number of ways of arranging four aces is four
factorial, written as 4! or 4. But the number of ways of arranging the two aces picked is 2!,
and the number of ways of arranging the other two aces is 2!
Then the number of ways of choosing 2 aces our of 4 aces is given by the notation:
2
BINOM 4 2 = {4!} OVER {2!2!}
=6
BINOM 4 3 = {4!} OVER {3!1!} = 4
Likewise, the number of ways of getting 3 Jacks out of 4 is
Therefore, there are 6 x 4 = 24 ways of having 2 aces and 3 Jacks. The total number of 5 card
poker hands, each one of which is equally likely is
N = BINOM 52 5 = {52!} OVER {5!47} = 2,598,960
Therefore the probability of getting 2 aces and 3 Jacks is
{ 24 } OVER { 2,598,960 } = 0.9 x 10
SUP - SUP 5
Algebraic Presentation
We sometimes express the results in a special algebraic notation or geometric diagram (a
Venn diagram).
An experiment (pulling candies out of a box) can lead to event A (mint candies, for
example) or event B (toffees). We sometimes use a notation:
P(A)
P(B)
Union of two events
P(AUB)
Intersection of events
P(AB)
probability of an event (A)
probability of an event (B)
A & B is called A U B
A & B is called A or B
A & B is called A  B
A & B is called both A and B
Geometric Presentation
Sometimes the overlap of probabilities is described in a VENN diagram. I wont draw
one here. Transfer of files makes a mess of them.
Additivity:
P(AUB) = P(A) + P(B) - P(AB)
e.g. If the probability that a polluting industry locates in Munich is 0.7, and the probability that
it will locate in Brussels is 0.4 and the probability that it will be in either area or both is 0.8, what
is the probability that it is in both?; in neither?
3
If A & B are independent there is no overlap in the Venn diagram and
P(AB) = 0.
Conditional probability
We often want to discuss a conditional probability. The probability of an event B given
that the event A occurred
P(BA) = P(AB) (P(A)  0)
P(A)
e.g. Probability that US Air Shuttle departs on time P(D) = 0.83
Probability that US Air Shuttle arrives on time P(A) = 0.72
We get these from statistical records at each end. The probability of both on time
departure and arrival is 0.78.
P(AD) = P(DA) = 0.78 = 0.94
P(D)
0.83
Now in this case the event A is not independent of event A. If it is
P(AB) = P(A)
For example:
The probability of a person getting leukemia from working 10 years at an average level
of benzene in the workplace of 1 ppm is 0.1%.
The probability of anyone getting leukemia in his lifetime is 0.7%.
If you know that a person died of leukemia and worked 10 years at 1 ppm of benzene in
the workplace what is the likelihood that his leukemia was caused by benzene?
4
P(BA)
=
0.1% = 0.15
0.7%
This is sometimes called the Probability of Causation.
Distributions
Probability
Cumulative function
f(x)
F(x)

















 









 


















HISTOGRAM
CUMULATIVE PROBABILITY
f(x) is the empirical distribution
STACK {SIGMA # x} ~~f(x) = 1
Subject to:
Then P(X=x) = f(x)
P(x<X<x+dx) = f(x) dx
P(a<x<b) = INT_a^b f(x) ~dx
INT_{-INF}^{+INF} f(x) dx = 1
We can go to a continuous distribution
The Cumulative Function
5
F(x) = INT_{-INF}^x f(t)dt
f(x) = d{F(x)} over dx
6
Parameters of a distribution
We have a set of numbers xi where i varies from 1 to N. To each is a weighting factor fi
(fi > 0)
= stack{N # SIGMA #i=1} ~f_i
N the total weight
Stack{N # SIGMA #i = 1} f_i x_i/N
x the arithmetic mean
GM the geometric mean
= N sqrt{pi (x_i)^{f_i}}
stack {pi # 1  N}
is a symbol for product of all quantities 1 to N.
ln(GM) = fi ln (xi/N)
[We note that ln(GM) is the arithametic mean of the quantities ln xi]
mode = value of xi with greatest weight fi
median = for unweighted items, the value of x exceeded by 1/2 the xi in the list.
The variance describes the deviation from the expected value.
In problem 1-1 we can define the average of an infinite set of measurements to get the
= stack{N # SIGMA # i=1} (x_i - overline{x})^2 /N
expected value E.
The root mean square (rms) or standard deviation is the square root of this. We define
the variance of a set of numbers and xi, which is the sum of squares of the deviation of individual
measurements about the average of an infinite set of such measurements of which xi is a small
subset.
7
An estimate of the standard deviation can be obtained from the data. This is similar to
the rms deviation from a small set, but has N-1 in the denominator, not N. This is because we
s = sqrt{stack{N # SIGMA # 1} (x_i - overline{x})^2 /( N-1)}
cannot get a deviation from one measurement; we need at least two.
The difference between N-1 and N is in practice small, and most people (including me) are
sloppy about the distinction, but it is important methodologically.
(x_i - overline{x})^2
Expanding
stack{N # SIGMA # 1} x_i = k overline{x}
and noting that
s = sqrt{{SIGMA {x_i}^2 - N {overline{x}}^2} over{N-1}}
we find
Parameters of a distribution
We start with a set of numbers xi where i has a range 1 to N. To each xi we can assign a
weight
= stack{N # SIGMA # i=1} f_i = N
N, the total weight
If we consider a continuous distribution of numbers we can turn the sum into an integral and get
= INT_{-INF}^{+INF} f(x) dx
total weight
mu = overline{x} = E(x)
We will discuss mostly continous distributions.
INT_{-INF}^{+INF} f(x) dx
the arithmetic mean
The mean of a function g(x) of x is
8
mu g(x) = E(g(x)) = INT_{-INF}^{+INF}~g(x) f(x)
9
Joint Distribution
The probability of a number lying between x & (x + dx) AND y & y + dy is
f(x,y) dx dy
 f(x,y) dx dy  1
g(x) =  f(x,y) dy
h(y) =  f(x,y) dx
A conditional probability distribution
f(yx) = f(x,y)
g(x)
If f(x,y) does not depend on y
f(xy) = g(x)
f(x-y) = g(x) h(y)
The normal, or Gaussian, distribution
If a large number of independent measurements are values of the same quantity, the
probability of each individual measurement lying between x and
x + dx is P(x) dx; if we also assume that P(x) is symmetric about x = 0, the distribution of these
measurements follows the "normal curve". This is sometimes called the Gaussian curve, in
honor of the 19th century German mathematician, physicist and geographer. This curve, and the
picture of Gauss, can be found on the German 10 mark banknote.
n (x ; mu, sigma) = 1 over{sqrt{2 pi}} 1 over {sigma} e^{(x- mu)^2/2 sigma^2}
1 over {sqrt {2 pi}} CDOT 1 over {sigma}
Note that
INT_{-infinity}^{+ infinity} n(x) dx = 1
is in the equation. This enables the distribution to be normalized to unity:
(µ is called the expectation or mean of the distribution and 2 is called the variance.
Standard normal distribution
Put:
10
z = {x- mu} over{sqrt{sigma^2}}
n(z) = 1 over{sqrt 2}pi ^e SUP {-z^2/2}
then
INT_0^x n(z) dz
This is the standard normal distribution. This, and in particular,
is tabulated in various texts.
For the distribution of the sum of two quantities, each normally distributed, we have also
sigma^2 = sigma_1^2 + sigma_2^2
a normal distribution with standard deviations given by:
One might, for example, measure the distance between two points:
L = L1 - L2
P(L) = 1 over{sqrt{2 pi}} 1 over sigma exp ({(L- overline{L_1-L_2})^2} over{2 sigma^2})
The distribution in L will be normal with
sigma^2 = sigma_1^2 + sigma_2^2 APPROX 2 sigma^2 ~~~ if ~~~
sigma_2
sigma_1 approx
and
where 1 is standard deviation of measurements of L1.
[As an exercise, prove these last two statements].
Attached are another set of notes of the behavior of log normal distributions. Log
normal distributions arise when the logarithm of a quantity is normally distributed. They are
very widely used in exposure and risk analysis.
Binomial distribution
If p in the probability of success in a trial, and q = 1-p is the probability of failure, what is the
b (x;n,p) = ~BINOM n x ~p^x q^{n-x}
distribution of successes in N trials?
µ  Mean
= np
2
Variance   = npq = q µ
Example: We have a bioassay with 100 rats.
11
If p = 0.2 is the probability of getting cancer in a lifetime.
Then µ = 20
2 = 0.8 x 20 = 16
These numbers are the coefficients in a binomial expansion. Hence the term Binomial
Distribution.
Limits as n  
z = {x-np} over sqrt{npq}
Then we get the standard normal distribution with
[Unless p = 0 or q = 0]
WE WILL USE THIS APPROXIMATION (until we go to a computer program)
Poisson distribution
P(x; lambda t) = e^{-lambda t}{(lambda t)^x} over {x!}
Probability distribution of x given by number of independent outcomes in t is
 = average number of outcomes in unit time.
Example: If the average rate of oil tankers entering NY Harbor is 10/day, yet we can only come
P(x>15) is ~~1 - P (x<15)
= 1 - STACK {15 # SIGMA # 0}~~ P(x; 10)
with 15 in any one day, how often are we in trouble?
P (x;10) can be read from the tables in the books or calculated. (Calculate it once in
= 1 - 0.9513 = 0.0487/day
your lives!
Exercise:
1. Prove mean of distribution is t
2. Prove variance of distribution is t.
If
n   and p  
keeping np = constant
then
b(x;n,p)  p(x;µ)
12
Goodness of Fit
How often do we get a deviation  >  ?
 is called the Critical Ratio.
If a normal distribution
int_xi^inf~e^{-z^2/2} dz
area above  =
int_xi^inf e^{-z^2/2}dz / int_{-inf}^{+inf} e^{-z^2/2}dz
or the fractional area
This is so important that there are tables in all books "areas under the normal curve".
BUT BEWARE: are you asking for the area above  as a fraction of the whole distribution or
INT_0^INF~~~ ~~~ INT_{-INF}^{+INF}~?
as a fraction of the half distribution?
It all depends on the problem. The distributions are called one sided, or one tailed, vs two sided
or two tailed distributions.
13

1
2
3
Fraction of Area Above 
One Sided
Two Sided
0.84
0.98
0.9987
0.68
0.95
0.9974
We use words to describe the tail: "Upper 95 percentile".
( = 2 two sided
 = 1.645 one sided)
Exercise:
If you perform a large number of bioassays with 100 rats each and call them statistically
significant if P<0.05, how many times do you expect to be wrong?
Statistical Independence
The concept of statistical independence of quantities cannot be overstated. There are two major
ways in which mistakes are made. One I loosely call the Feynman trap; the other the Tippett
trap.
Posing a problem for his undergraduate class, Richard Feynman, the Nobel physicist,
noted a car in the parking lot, with a particular license plate, ARW357. One can easily assess
the probability of seeing this license plate, by multiplying the independent probabilities of seeing
each number (1/10) and each letter (1/26). The answer is one is eighteen million. Yet
Feynman had just seen the license plate, so it had unity probability! Since Feynman asked the
question when he already knew the answer, the statistical calculation was invalid. This point
has been raised, less dramatically, by many others. See D.L. Goodstein, "Richard P. Feynman,
Teacher", Physics Today 70-75 (Feb. 1989).
We will discuss in the seminar various ways this appears in disguised forms in practice.
Tippett, a famous English statistician, pointed out that if one sets a level of significance
p<0.05, and then looks at twenty separate studies, one will be significant at this level.
14
15