Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Practice & Communication of Science From Probability to Distributions @UWE_KAR Where Does Variation Come From? Measurements in complex systems (eg living creature) vary… Between organisms, and within (eg over time) As a result of our measurement of it (eg error) Or sometimes as a result of experiments we do For example, aspects of lung function… Forced Vital Capacity (FVC) – total volume of air you can shift in one maximal, forced exhalation dependent on size/elasticity of lungs/thorax Forced Expiratory Volume at 1 second (FEV1) – volume of air you shift in first second of above dependent on airway resistance Thought Experiment Measure lung function of all 7 billion people on the planet with a very, very, very sensitive spirometer (sub-µl)… 7 billion different sets of readings? Variation due to innumerable influences… Age, height, weight, gender, blood pressure, blood glucose, hormone levels, nutrition at age 5, nutrition at age 6, gene variant tvc15, etc, etc Sitting, standing, air pressure, temperature, voltage, tubing angle, experimenter tiredness, enthusiasm, hydration, position of Mars, etc, etc Visualising the Variation Here are some actual readings for FVC (litres)… 2.159, 2.065, 1.518, 2.227, 2.09, 2.451, 1.871, 2.571, 2.532, 2.545, 2.538, 2.795, 2.102, 1.804, 2.432, 2.704, 2.258, 2.282, 1.663, 2.795, 2.238, 1.953, 2.382, 2.344, 2.967, 2.68, 2.413, 2.444, 1.953, 2.314, 2.15, 2.634, 2.598, 2.09, 2.641, 2.92, 2.727, 2.307, 2.76, 2.439, 2.259, 2.111, 2.58, 2.602, 2.461, 3.128, 2.241, 2.602, 3.177 We could plot each value along an x-axis… Note how they bunch in middle (and if very close together get displaced upward) Frequency Distributions Divide the x-axis up into ‘bins’ of a given width Put each reading in the appropriate ‘bin’ In each bin, points sit on top of each other So height of stack represents ‘frequency’ of the range of readings represented by the bin This frequency-distribution has a curved shape bunching in the middle a few extreme values Distribution of a Big Dataset The 50 readings part of a larger set of data around 2400 readings each point represents up to 10 readings famous ‘Bell Curve’ (an idealised prob density) Origin of the Bell Curve An individual’s FVC determined by lots of things not all influences are equally influential, but… each will either have a positive or negative effect of the reading So, an individual’s FVC determined by concerted action of countless, seemingly random, +ve and –ve ‘nudges’ throughout their life For a population, the influence of the large number of ‘random’ +ve or –ve effects on each individual Bell Curve aka Gaussian or Normal Distribution Modelling Effect of +/- Nudges Falling balls - a simple model (quincunx) No obstacles; ball comes to rest directly below An obstacle placed in its path will deflect it L or R No ‘nudges’ left or right on the way down ‘random’ Layers of obstacles cumulative effect eg ++-+--+-++ = ++ governed by probability Rolling Dice Same Outcome One die… outcome values are 1, 2, 3, 4, 5 or 6 each equally probable (1 in 6) distribution is… boring! Rolling Dice Two dice… outcome values are 2,3,4,5,6,7,8,9,10,11,12 each not equally probable 36 ways of making these only 1 way to get 2 (1+1), 3 ways to get 4, etc distribution is… slightly less boring! Rolling Dice Three dice… outcome values are 3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18 each not equally probable 216 ways of making these 27 ways to throw a 10 or 11, only 1 to get a 3 or 18 distribution is… starting to curve Rolling Dice Four dice… outcome values are 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,18,20,21, 22,23,24 1926 ways of making these each not equally probable distribution is… looking familiar! 24 dice… outcome values are 24 144 Rolling Dice 4.73838134 × 1018 ways of making these! each not equally probable distribution is… looking very familiar! 120 ‘discrete’ outcomes approximates the Normal Distribution (is, if ‘infinite’) Why Do Balls & Dice Model FVC?!!! The models generating the ND are uniform The factors influencing FVC are not like that! ‘pegs’ & layers are the same; balls hit pegs one at a time, one layer at a time; all dice are the same so why does a simple model mirror complex reality? The Central Limit Theorem horribly technical, but basically… any time you have a quantity which is bumped around by a large number of random processes, you end up with a bell curve distribution for that quantity. And it really doesn’t matter what those random processes are. They themselves don’t have to follow the Gaussian distribution. So long as there’s lots of them and they’re small, the overall effect is Gaussian (http://scienceblogs.com/ builtonfacts /2009/02/05 /the-central-limit-theorem-made/) The CLT/ND is everywhere! A picture of the steps in the old Magistrates Court in Lampeter, Wales Each footstep miniscule erosion Many factors influence position of each footstep essentially ‘random’ and conform to the CLT ND Uses of The Normal Distribution Many things in biosystems normally distributed A normal distribution is fully defined by 2 things a mean (average value) a standard deviation (an indication of ‘spread’) So, the 2445 FVC measurements can be summarised as 3.53 ± 0.92 L (n=2445) If we know our data fits a distribution, then do an experiment, statistics can estimate prob that distribution has (not) changed ie an ‘independent’ test of whether our expt worked so distributions/stats central to scientific method! Distributions, Experiments & Stats Expts ‘prod’ nature to see if s/he responds need to compare ‘before’ and ‘after’ measurements to decide if/how nature responded starting point is nature didn’t respond Null Hypothesis ie distribution of observations identical before & after Things like the CLT tell us how measurements in our expts should distribute statistics looks at the ‘overlap’ between before & after distributions to estimate the probability that… both belong to the same distribution (Null Hypo) the two distributions are different (Alternative Hypo) ie nature responded to whatever we did! Summary Variation often arises due to complexity of system studied (eg humans) Displaying data as frequency distributions often Bell Curve/Normal Distribution Simple systems (quincunx, dice) Norm Dist Complex systems also Norm Dist eg lung function varies since factors affecting it vary because of the Central Limit Theorem If we know how we expect our data to distribute, we can estimate probability that distribution changes when we do experiments ie we can do science!