Download Here

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Practice & Communication of Science
From Probability to Distributions
@UWE_KAR
Where Does Variation Come From?

Measurements in complex systems (eg living
creature) vary…




Between organisms, and within (eg over time)
As a result of our measurement of it (eg error)
Or sometimes as a result of experiments we do
For example, aspects of lung function…

Forced Vital Capacity (FVC) – total volume of air
you can shift in one maximal, forced exhalation


dependent on size/elasticity of lungs/thorax
Forced Expiratory Volume at 1 second (FEV1) –
volume of air you shift in first second of above

dependent on airway resistance
Thought Experiment

Measure lung function of all 7 billion people on
the planet with a very, very, very sensitive
spirometer (sub-µl)…


7 billion different sets of readings?
Variation due to innumerable influences…


Age, height, weight, gender, blood pressure, blood
glucose, hormone levels, nutrition at age 5,
nutrition at age 6, gene variant tvc15, etc, etc
Sitting, standing, air pressure, temperature,
voltage, tubing angle, experimenter tiredness,
enthusiasm, hydration, position of Mars, etc, etc
Visualising the Variation

Here are some actual readings for FVC (litres)…

2.159, 2.065, 1.518, 2.227, 2.09, 2.451, 1.871,
2.571, 2.532, 2.545, 2.538, 2.795, 2.102, 1.804,
2.432, 2.704, 2.258, 2.282, 1.663, 2.795, 2.238,
1.953, 2.382, 2.344, 2.967, 2.68, 2.413, 2.444,
1.953, 2.314, 2.15, 2.634, 2.598, 2.09, 2.641, 2.92,
2.727, 2.307, 2.76, 2.439, 2.259, 2.111, 2.58,
2.602, 2.461, 3.128, 2.241, 2.602, 3.177
We could plot each value along an x-axis…

Note how they bunch in middle


(and if very close together get displaced upward)
Frequency Distributions


Divide the x-axis up into ‘bins’ of a given width
Put each reading in the appropriate ‘bin’



In each bin, points sit on top of each other
So height of stack represents ‘frequency’ of the
range of readings represented by the bin
This frequency-distribution has a curved shape


bunching in the middle
a few extreme values
Distribution of a Big Dataset

The 50 readings part of a larger set of data



around 2400 readings
each point represents up to 10 readings
famous ‘Bell Curve’ (an idealised prob density)
Origin of the Bell Curve

An individual’s FVC determined by lots of things

not all influences are equally influential, but…



each will either have a positive or negative effect of
the reading
So, an individual’s FVC determined by
concerted action of countless, seemingly
random, +ve and –ve ‘nudges’ throughout their
life
For a population, the influence of the large
number of ‘random’ +ve or –ve effects on each
individual  Bell Curve

aka Gaussian or Normal Distribution
Modelling Effect of +/- Nudges

Falling balls - a simple
model (quincunx)

No obstacles; ball comes
to rest directly below


An obstacle placed in its
path will deflect it L or R


No ‘nudges’ left or right
on the way down
‘random’
Layers of obstacles 
cumulative effect


eg ++-+--+-++ = ++
governed by probability
Rolling Dice  Same Outcome

One die…



outcome values are 1, 2, 3, 4, 5 or 6
each equally probable (1 in 6)
distribution is…

boring!
Rolling Dice

Two dice…

outcome values are 2,3,4,5,6,7,8,9,10,11,12


each not equally probable


36 ways of making these
only 1 way to get 2 (1+1), 3 ways to get 4, etc
distribution is…

slightly less boring!
Rolling Dice

Three dice…

outcome values are
3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18


each not equally probable


216 ways of making these
27 ways to throw a 10 or 11, only 1 to get a 3 or 18
distribution is…

starting to curve
Rolling Dice

Four dice…

outcome values are
4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,18,20,21,
22,23,24



1926 ways of making these
each not equally probable
distribution is…

looking familiar!

24 dice…

outcome values are 24 144



Rolling Dice
4.73838134 × 1018 ways of making these!
each not equally probable
distribution is…



looking very familiar!
120 ‘discrete’ outcomes
approximates the Normal Distribution (is, if ‘infinite’)
Why Do Balls & Dice Model FVC?!!!

The models generating the ND are uniform


The factors influencing FVC are not like that!


‘pegs’ & layers are the same; balls hit pegs one at a
time, one layer at a time; all dice are the same
so why does a simple model mirror complex reality?
The Central Limit Theorem

horribly technical, but basically…

any time you have a quantity which is bumped around by a large
number of random processes, you end up with a bell curve
distribution for that quantity. And it really doesn’t matter what those
random processes are. They themselves don’t have to follow the
Gaussian distribution. So long as there’s lots of them and they’re
small, the overall effect is Gaussian (http://scienceblogs.com/
builtonfacts /2009/02/05 /the-central-limit-theorem-made/)
The CLT/ND is everywhere!



A picture of the steps in the old Magistrates
Court in Lampeter, Wales
Each footstep  miniscule erosion
Many factors influence position of each footstep

essentially ‘random’ and conform to the CLT  ND
Uses of The Normal Distribution


Many things in biosystems normally distributed
A normal distribution is fully defined by 2 things




a mean (average value)
a standard deviation (an indication of ‘spread’)
So, the 2445 FVC measurements can be
summarised as 3.53 ± 0.92 L (n=2445)
If we know our data fits a distribution, then do
an experiment, statistics can estimate prob that
distribution has (not) changed


ie an ‘independent’ test of whether our expt worked
so distributions/stats central to scientific method!
Distributions, Experiments & Stats

Expts ‘prod’ nature to see if s/he responds


need to compare ‘before’ and ‘after’ measurements
to decide if/how nature responded
starting point is nature didn’t respond



Null Hypothesis
ie distribution of observations identical before & after
Things like the CLT tell us how measurements
in our expts should distribute

statistics looks at the ‘overlap’ between before &
after distributions to estimate the probability that…


both belong to the same distribution (Null Hypo)
the two distributions are different (Alternative Hypo)

ie nature responded to whatever we did!
Summary

Variation often arises due to complexity of
system studied (eg humans)




Displaying data as frequency distributions often
 Bell Curve/Normal Distribution
Simple systems (quincunx, dice)  Norm Dist
Complex systems also  Norm Dist


eg lung function varies since factors affecting it vary
because of the Central Limit Theorem
If we know how we expect our data to
distribute, we can estimate probability that
distribution changes when we do experiments

ie we can do science!