Download Probabilities and distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Probabilities and
distributions
Peter Shaw
Introduction




The study of probabilities goes back to a Renaissance
dice game, when the Chevalier De Mere posed the
following puzzle. Which is more likely (1) rolling at least
one six in four throws of a single die or (2) rolling at
least one double six in 24 throws of a pair of dice? The
mathematician Fermat was eventually involved, and
statistical analysis was born.
The key element here is the notion of randomness,
inherent in use of dice.
Latin ‘Alea’ = dice, gives French ‘Aleatoire’ = random.
(The answer is that getting 1 six in 4 throws is more
likely, but only by a tiny margin, p=0.5177 vs p = 0.491)
You never get
a straight The notion of probability is invoked in
situations where outcomes are uncertain,
answer…
or where measurements are subject to



detectable levels of error.
In practice this is most situations most of
the time!
The media keep looking to scientists for
absolute answers:



Is beef absolutely safe?
Are we sure that the climate warming is
due to CO2?
Anyone who says “Yes” is not a scientist.
The correct answer is “very likely”. You
cannot get absolute answers, but you
can get estimates of likelihood =
probability.
Roll 2 dice..
There are 36 possible outcomes
Only 1 combination adds to 2, so
P(2) = 1/36
1
2
3
4
5
6
1
2
3
4
5
6
7
2
3
4
5
6
7
8
3
4
5
6
7
8
9
4
5
6
7
8
9
10
5
6
7
8
9
10
11
6
7
8
9
10
11
12
What is the most likely score, and why? P = ?
The distribution of dice scores
note it is symmetrical and peaks at 7 with a score of 6/36 = 1/6 = ?
3 4 5 6 7
(ignoring the rule about doubles that applies in backgammon)
0 1 2
Number of ways (out of 36)
Likelihood of 2 dice score sums
2
3
Score
4
5
6
7
8
9
10 11 12
You rolled double 6 –
you must be cheating!


In real life we often have to decide
whether an event is a random
fluke, or indicates a genuine
pattern.
If I rolled 6 sixes, would I have
cheated? Actually it is very likely,
as 6 sixes would occur 1 time in
6*6*6*6*6*6 = 46,656. But it
COULD be due to chance.
We use probability as a
tool in decision making.




The field of inferential analysis relies on
finding an estimate of the probability for
statements being true.
Statement 1:“Soil 1 is more polluted than
soil 2”
Statement 2:“Soil 1 is exactly as polluted
as soil 2, any observed differences are
due to chance”.
If you find p(Statement 2) = 1 in a million,
you judge the 2 soils to differ.
We use probability as a
tool in decision making.





The field of inferential analysis relies on finding an
estimate of the probability for statements being true.
Statement 1”Patients treated with compound X have
(eg) lower blood sugar levels than untreated patients.”
Statement 2:“Patients treated with compound X do not
differ from untreated patients, any while there may
measurable differences, these are due to chance
alone”.
If you find p(Statement 2) = 1 in a million, you judge the
2 groups of patients to differ, implying that the
compound is having some detectable effect.
(Would this be absolute proof of its efficacy?)
Normal Distribution
Also known as the Gaussian distribution,
after Karl Gauss.
Note the symmetrical
bell-shaped curve
Number of observations
This is the expected distribution when
many randomly distributed factors add
together. It is found in distributions of
body height/weight, chemical
concentrations in soil/air/water, and
many other situations.
Size of value
Mean and median
about the same
Carl Friedrich Gauss
30/4/1777 – 23/2/1855
The Gaussian distribution was one of the many deeply
significant mathematical discoveries made by Carl Gauss,
who was probably the greatest mathematician in history.
At the age of 7, when he started school, he was asked (by
an exasperated tutor who wanted to put this little upshot in
his place) to add up the numbers from 1+2+3…+99+100.
Little Carl promptly and contemptuously write down 5050
on his slate and threw it onto the teacher’s desk!
How we think he did it:
1 + 100 = 101
2 + 99 = 101
3 + 98 = 101
Etc
There are 50 such pairs: 50*101 = 5050
You only need 2
numbers to define a
Normal curve:
The mean μ
The standard deviation σ
Any observation in a dataset
can be re-coded in terms of
how many standard
deviations away from the
mean it lies
μ σ
A powerful universal
principal:



The Normal distribution is immensely
useful because it is universal: The same
shape describes human height,
hardness of stones, strength of winds…
The way to convert any arbitrary set of
data into the universal distribution is to
recode as follows:
Convert each observation into a number
telling you how many s.d.s it is away
from the mean.


This is called a Z score (I don’t know
why):
Zi = (Xi- μ)/σ
And the point of this?




Is that you can look up Z scores in
tables, confident in the knowledge that:
C. 66% of the points will lie between Z=1.0 and Z=1.0 (ie within 1 sd of the
mean)
C. 95% of the points lie within +- 2sd of
the mean
99.9% of points are within+- 3sd of mean
We’ll try this out!



Measure the length of your left
index finger, in mm.
I’ll enter a subset into the PC, and
we’ll see whether a Gaussian
curve emerges.
Given the mean + sd, you work out
your own Z score!
You should know:
That the area under the standard normal curve
Corresponds to probability, specifically the probability
Of finding an observation less than a given Z value.
The total area under the curve, from infinity to – infinity = 1.0
You don’t need to know:
Equation of curve is: Y = 1/ (2π) ½ exp(-½Z*Z)
Z = 0, area = above Z = 0.5, ie
half the curve lies below the
mean
Z = 1.0, area = above Z =0.1587, ie
about 85% data lies below (mean + 1
sd)
Applied example:



A factory making widgets can only
sell those whose length is between
98 and 101 mm diameter.
The machine makes widgets with a
mean of 100mm and an sd of
0.7mm.
What % of widgets are rejected as
unsaleable due to size?
Convert data into Z scores:
98 (98-100)/0.7 = -2.85
101 (101-100)/0.7 = 1.42
Area above Z = 1.42. = 0.159
Area below Z = -2.8 = 0.003
Acceptable area (purple) =
1- (0.159+ 0.003) = 0.838
Lower tail of distribution
area = 0.003
Upper tail of distribution, area = 0.1587
20
Often real data don’t follow the
Normal curve but are skewed – here
organic content in heath soils
10
Std. De v = 27.97
Mean = 29.3
N = 69. 00
0
5.0
15.0
10.0
25.0
20.0
35.0
30.0
45.0
40.0
55.0
50.0
65.0
60.0
75.0
70.0
85.0
80.0
90.0
LOI
12
Try log-transforming the data.
Here the same data after
calculating log of the numbers –
not perfect, but clearly more
symmetrical
10
8
6
4
2
Std. Dev = .44
Mean = 1.26
N = 69.00
0
.63
.88
.75
LOGLOI
1.13
1.00
1.38
1.25
1.63
1.50
1.88
1.75
Normal P-P Plot of LOI
1.00
How to decide about
normality?
.75
.50

.25
0.00
0.00
.25
.50
.75
1.00

Observed Cum Prob
Normal P-P Plot of LOGLOI
1.00

.75
.50
.25
0.00
0.00
.25
.50
.75
1.00
Inspect histogram + fitted
normal curve.
Inspect a cumulative “P-P
curve” with predicted normal
distribution
Run the KolgomorovSmirnov test
The Kolgomorov-Smirnov test
examines whether data can be
assumed to come from a chosen
distribution – here the normal.

One-Sample Kolmogorov-Smirnov Test
LOI
N
Normal Parametersa, b
Mos t Extreme
Differences
Mean
Std. Deviation
Abs olute
Pos itive
Negative
Kolmogorov-Smirnov Z
Asymp. Sig. (2-tailed)
69
29.2806
27.9695
.217
.217
-.183
1.804
.003
LOGLOI
69
1.2603
.4409
.086
.080
-.086
.716
.685
a. Tes t dis tribution is Normal.
b. Calculated from data.
LOI is almost certainly NOT
normally distributed
LogLOI may or may not
be normal, but the test
tells us that its deviations
from normality would
occur 7 times in 10 in
randomly chosen normal
data