Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Lecture 9 Moments of distributions Body size distribution of European Collembola Body size distribution of European Collembola Body Species weight [mg] Tetrodontophora bielanensis (Waga 1842) 13.471729 Orchesella chiantica Frati & Szeptycki 1990 13.471729 Disparrhopalites tergestinus Fanciulli, Colla, Dallai 2005 12.924837 Orchesella dallaii Frati & Szeptycki 1990 9.4503028 Seira pini Jordana & Arbea 1989 9.4503028 Isotomurus pentodon (Kos,1937) 7.1044808 Heteromurus (V.) longicornis (Absolon 1900) 7.1044808 Pogonognathellus flavescens (Tullberg 1871) 6.9512714 Orchesella hoffmanni Stomp 1968 6.9512714 Heteromurus (H) constantinellus Lučić, Ćurčić & Mitić 2007 6.3862223 Pogonognathellus longicornis (Müller 1776) 6.2133935 Orchesella devergens Handschin 1924 6.2133935 Orchesella flavescens (Bourlet 1839) 6.2133935 Orchesella quinquefasciata (Bourlet 1841) 6.2133935 Number of species 500 Modus Collembola 400 300 200 100 0 -4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25 ln body weight class ln body Number ln weight [mg] of weight class means species 2.6006 -4.71511 7 2.6006 -4.018377 53 2.5592 -3.321643 133 2.246 -2.624909 224 2.246 -1.928176 353 1.9607 -1.231442 395 1.9607 -0.534708 325 1.9389 0.162025 126 1.9389 0.858759 45 1.8541 1.555493 24 1.8267 2.252226 9 1.8267 1.8267 1.8267 The histogram of raw data Three Collembolan weight classes Class 1 N 25 Mean 1.8169079 2.6005933 2.5591508 2.2460468 2.2460468 1.9607257 1.9607257 1.9389246 1.9389246 1.8541429 1.8267072 1.8267072 1.8267072 1.8267072 1.8267072 1.584378 1.584378 1.584378 1.584378 1.584378 1.584378 1.5326904 1.5326904 1.5064044 1.4529137 1.4529137 Class 2 31 1.032923 1.313477 1.313477 1.313477 1.313477 1.313477 1.301948 1.225568 1.165038 1.165038 1.165038 1.165038 1.006355 1.006355 1.006355 1.006355 1.006355 1.006355 1.006355 1.006355 1.006355 1.006355 0.939683 0.871022 0.871022 0.835906 0.835906 0.800247 0.800247 0.764026 0.756712 0.727225 Class 3 43 0.531059 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.651808 0.613152 0.573835 0.573835 0.533834 0.493125 0.493125 0.493125 0.493125 0.493125 0.489014 0.451682 0.451682 0.451682 0.451682 0.409479 What is the average body weight? n n x i 1 i x x i 1 i n n Population mean Sample mean Weighed mean x 25 31 43 1.812 1.033 0.531 1.013 99 99 99 k k ni 1 k x xi ni xi xi f (i ) n i 1 n i 1 i 1 Number of species 0.25 0.2 f ( x1 ) Weighed mean Collembola ni n n k k xi ni xi x xi f ( xi ) i 1 n i 1 n i 1 0.15 0.1 0.05 Discrete distributions 0 -4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25 ln body weight class ln body Number weight [mg] of class means species -4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25 7 53 133 224 353 395 325 126 45 24 9 Sum 1694 Frequency Arithmetic mean =B2/B14 0.031286895 0.078512397 0.132231405 0.208382527 0.233175915 0.191853601 0.074380165 0.026564345 0.014167651 0.005312869 =A2*C2 =(A2-D14)^2*C2 -0.125723 0.202268085 -0.26079 0.267516588 -0.347095 0.174619987 -0.401798 0.042653444 -0.287143 0.013917567 -0.102586 0.169898317 0.0120514 0.199510727 0.0228124 0.144774029 0.0220377 0.130178627 0.0119658 0.073837264 -1.475751 StDev Variance 1.462535979 1.209353538 The average European springtail has a body weight of e-1.476 = 023 mg. Most often encounted is a weight around e-1.23 = 029 mg. Continuous distributions max xf ( x)dx min Why did we use log transformed values? Average Body body length weight [mm] [mg] Species Tetrodontophora bielanensis (Waga 1842) Orchesella chiantica Frati & Szeptycki 1990 Disparrhopalites tergestinus Fanciulli, Colla, Dallai 2005 Orchesella dallaii Frati & Szeptycki 1990 Seira pini Jordana & Arbea 1989 Isotomurus pentodon (Kos,1937) Heteromurus (V.) longicornis (Absolon 1900) Pogonognathellus flavescens (Tullberg 1871) Orchesella hoffmanni Stomp 1968 Heteromurus (H) constantinellus Lučić, Ćurčić & Mitić 2007 Pogonognathellus longicornis (Müller 1776) Orchesella devergens Handschin 1924 Orchesella flavescens (Bourlet 1839) Orchesella quinquefasciata (Bourlet 1841) Log transformed data Collembola 400 300 200 100 0 -6.00 13.472 13.472 12.925 9.4503 9.4503 7.1045 7.1045 6.9513 6.9513 6.3862 6.2134 6.2134 1.875 6.2134 6.2134 =JEŻELI(B86=0;0;EXP(-1.875+LN(B86)*2.3)) W [mg] e Linear data 500 Number of species Number of species 500 7 7 6.875 6 6 5.3 5.3 5.25 5.25 5.06 5 5 5 5 5 [W / L]L[mm]2.3 Collembola 400 The distribution is skewed 300 200 100 0 -4.00 -2.00 0.00 ln body weight class 2.00 4.00 0 2 4 6 Body weight class 8 10 W [mg ] e 1.875[W / L]L[mm]2.3 Body weight Number [mg] class of means species W W0 Lz ln W ln W0 z ln L Number of species 500 Collembola 400 300 200 100 0 0 2 4 6 Body weight class n n n x i e ln xi i 1 i 1 n 8 10 0.01 0.02 0.04 0.07 0.15 0.29 0.59 1.18 2.36 4.74 9.51 7 53 133 224 353 395 325 126 45 24 9 Sum Exp() 1694 lb scaled weight classes Frequency Arithmetic mean Geometric mean 0.004132231 0.031286895 0.078512397 0.132231405 0.208382527 0.233175915 0.191853601 0.074380165 0.026564345 0.014167651 0.005312869 3.702E-05 0.0005626 0.0028338 0.0095797 0.0303016 0.0680574 0.1123956 0.0874629 0.062698 0.0671181 0.0505194 -0.019483926 -0.125722539 -0.260790153 -0.347095405 -0.401798187 -0.287142615 -0.102585655 0.012051446 0.02281237 0.022037681 0.011965782 0.491566 -1.4757512 0.228606933 The average European springtail has a body weight of e-1.476 = 023 mg. Geometric mean In the case of exponentially distributed data we have to use the geometric mean. To make things easier we first log-transform our data. ln body Number weight [mg] of class means species -4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25 7 53 133 224 353 395 325 126 45 24 9 Sum 1694 Frequency =B2/B14 0.031286895 0.078512397 0.132231405 0.208382527 0.233175915 0.191853601 0.074380165 0.026564345 0.014167651 0.005312869 Arithmetic mean =A2*C2 =(A2-D14)^2*C2 -0.125723 0.202268085 -0.26079 0.267516588 -0.347095 0.174619987 -0.401798 0.042653444 -0.287143 0.013917567 -0.102586 0.169898317 0.0120514 0.199510727 0.0228124 0.144774029 0.0220377 0.130178627 0.0119658 0.073837264 -1.475751 StDev 1.462535979 1.209353538 Mean Number of species 0.25 0.2 f ( x1 ) i 1 n 1 2 (x i 1 i )2 n Degrees of freedom Variance n s ( xi x) 2 f ( xi ) 2 i 1 Continuous distributions s2 2 ( x x ) f ( x)dx min 0.15 0.1 s2 ( xi x ) 2 max Collembola ni n n n Variance 1 SD s s2 Standard deviation 0.05 0 -4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25 ln body weight class The standard deviation is a measure of the width of the statistical distribution that has the sam dimension as the mean. Mean Variance Standard deviation 5.66 10.45 3.23 The standard deviation as a measure of errors Distance 1 2 3 4 5 6 7 8 9 10 Average NOx Standard concentration deviation 9.53 1.70 7.37 1.18 5.24 0.86 3.15 0.26 2.17 0.18 1.05 0.09 0.84 0.14 0.63 0.10 0.32 0.03 0.21 0.02 The precision of derived metrics should always match the precision of the raw data Concentration Environmental pollution Station NOx [ppm] 1 8.49 2 1.12 3 9.11 4 7.75 5 0.75 6 8.23 7 0.97 8 6.06 9 8.48 10 5.88 11 8.51 12 9.62 13 3.35 14 7.74 15 2.03 16 5.06 17 7.61 18 0.99 19 2.55 20 8.91 ± 1 standard deviation is the most often used estimator of error. The probablity that the true mean is within ± 1 standard deviation is approximately 68%. The probablity that the true mean is within ± 2 standard deviations is approximately 95%. 14 12 10 8 6 4 2 0 ± 1 standard deviation 1 2 3 4 5 6 Distance [km] 7 8 9 10 Standard deviation and standard error Mean Standard deviation 5.44 4.15 4.49 5.29 5.55 3.39 5.56 3.13 The standard deviation is constant irrespective of sample size. The precision of the estimate of the mean should increase with sample size n. The standard error is a measure of precision. SE Average NOx Standard Distance concentration deviation 1 2 3 4 5 6 7 8 9 10 9.53 7.37 5.24 3.15 2.17 1.05 0.84 0.63 0.32 0.21 3.32 2.45 1.24 0.67 0.87 0.34 0.14 0.10 0.03 0.02 Standard error n=20 0.74 0.55 0.28 0.15 0.19 0.08 0.03 0.02 0.01 0.01 SD n 12 10 Concentration Environmental pollution NOx Station [ppm] 1 8.49 2 1.12 3 9.11 4 7.75 5 0.75 6 8.23 7 0.97 8 6.06 9 8.48 10 5.88 11 8.51 12 9.62 13 3.35 14 7.74 15 2.03 16 5.06 17 7.61 18 0.99 19 2.55 20 8.91 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 Distance [km] Central moments n n n n i 1 i 1 s ( xi x) f ( xi ) ( xi ) f ( xi ) 2 xi x f ( xi ) ( x) 2 f ( xi ) 2 2 2 i 1 i 1 n n s ( xi ) f ( xi ) 2 x x ( x) 1 ( xi ) 2 f ( xi ) x 2 2 i 1 n 2 2 i 1 n xi xi s 2 i 1 i 1 n 1 n 1 2 E(x2) 2 [E(x)]2 Mathematical expectation First central moment First moment of central tendency The variance is the difference between the mean of the squared values and the squared mean k-th central moment E( X ) n E ( X ) xi f ( xi ) k k i 1 E( X ) k k x f ( x)dx 2 E ( x 2 ) E ( x) 2 n 2 2 2 ( X ) f ( X ) E (( X ) ) i i i 1 Third central moment E(( X )3 ) E( X 3 ) 3 E( X 2 ) 3 2 E( X ) 3 E( X 3 ) 3 E( X 2 ) 2 3 Skewness f(x) 2 4 x 6 8 0 1000 x 1500 Kurtosis 1 0.8 0.6 0.4 0.2 0 1 2 4 x 6 8 1.5 x 2 Left skewed distribution =0 0 <0 1 0.8 0.6 0.4 0.2 0 2000 Right skewed distribution Symmetric distribution ( X )4 E( )3 4 500 f(x) 0 >0 1 0.8 0.6 0.4 0.2 0 f(x) =0 1 0.8 0.6 0.4 0.2 0 f(x) f(x) E (( X )3 ) 3 1 0.8 0.6 0.4 0.2 0 >0 0 2 4 x 6 8 Lecture 10 Important statistical distributions What is the probability that of 10 newborn babies at least 7 are boys? p(girl) = p(boy) = 0.5 Bernoulli distribution 0.3 n k nk p(k ) p q k 0.25 p(X) 0.2 0.15 n p 0.1 0.05 i 0 0 0 2 4 6 8 i 1 10 X 10 7 3 10 8 2 10 9 1 10 10 0 p(k 6) 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.172 7 8 9 10 Bernoulli or binomial distribution n k nk p(k ) p q k 0.35 0.3 n x n x F (k ) p( x k ) p q x 0 x k np f(p) 0.25 0.2 0.15 0.1 0.05 0 0 2 npq 1 2 3 4 5 p 6 7 8 10 p(k ) 0.2k 0.810k k The Bernoulli or binomial distribution comes from the Taylor expansion of the binomial n i n 1 n n i ( p q) p q p (1 q) n 1 i 0 i i 0 i n n 9 10 Assume the probability to find a certain disease in a tree population is 0.01. A biomonitoring program surveys 10 stands of trees and takes in each case a random sample of 100 trees. How large is the probability that in these stands 1, 2, 3, and more than 3 cases of this disease will occur? 1000 0.01* 0.99999 0.0004 p (1) 1 1000 0.012 * 0.99998 0.0022 p (2) 2 1000 0.013 * 0.99997 0.0074 p (3) 3 Mean, variance, standard deviation 1000 * 0.01 10 2 1000 * 0.01* 0.99 9.9 9.9 3.146 1000 0.0100.991000 p(k 3) 1 p(k 3) 1 0.01i 0.99n i 1 i 0 0 1000 1000 1000 0.0110.99999 0.0120.99998 0.0130.99997 0.99 1 2 3 3 n k nk p(k ) p q k What happens if the number of trials n becomes larger and larger and p the event probability becomes smaller and smaller. np rp r p q 1 p 1 p r r (r k )! k rr k 1 (r k )! p( X k ) k !(r 1)! (r ) k (r ) r k ! r (r 1)!(r ) k 1 r lim r 1 e 1 r (r k )! lim r 1 k (r 1)!(r ) r p( X k ) k k! e Poisson distribution The distribution or rare events Assume the probability to find a certain disease in a tree population is 0.01. A biomonitoring program surveys 10 stands of trees and takes in each case a random sample of 100 trees. How large is the probability that in these stands 1, 2, 3, and more than 3 cases of this disease will occur? Poisson solution 1000 * 0.01 10 Bernoulli solution p (1) 0.0004 10 10 e 0.00045 1! 10 2 10 p(2) e 0.0023 2! 103 10 p(3) e 0.0076 3! p(1) p ( 2) 0.0022 p (3) 0.0074 The probability that no infected tree will be detected 100 10 p(0) e e 10 0.000045 0! p(0) e The probability of more than three infected trees Bernoulli solution p(0) p(1) p(2) p(3) 0.00045 0.0023 0.0076 0.019 p(k 3) 1 0.019 0.981 p(k 3) 0.99 p(k) 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 =1 =2 =3 0 1 2 3 4 =4 5 =6 6 7 8 9 10 11 12 13 k 2 Variance, mean 1 Skewness What is the probability in Duży Lotek to have three times cumulation if the first time 14 000 000 people bet, the second time 20 000 000, and the third time 30 000 000? The probability to win is p(6) 6!43! 1 49! 14000000 1 1 14000000 1 14000000 1 2 20000000 1.428571 14000000 1 3 30000000 2.142857 14000000 The probability of at least one event: p(k 1) 1 e 10 1 p1 e 0.368 0! 1.4285710 1.428571 p2 e 0.239 0! 2.142857 0 2.142857 p3 e 0.117 0! The zero term of the Poisson distribution gives the probability of no event The events are independent: p1, 2,3 0.368 * 0.239 * 0.117 0.01 A pile model to generate the binomial. If the number of steps is very, very large the binomial becomes smooth. Abraham de Moivre (1667-1754) f ( x) Ce The normal distribution is the continous equivalent to the discrete Bernoulli distribution f ( x) 1 e 2 ( x 2 ) 1 x 2 2 -2 -1.2 -0.4 0.4 1.2 X 0.05 0.04 0.03 0.02 0.01 0 2 Frequency Frequency 0.05 0.04 0.03 0.02 0.01 0 0.06 0.04 0.02 0 -2 -1.2 -0.4 0.4 1.2 X Frequency 0.15 0.1 0.05 0 -2 -1.2 -0.4 0.4 1.2 X Frequency Frequency Frequency The central limit theorem If we have a series of random variates Xn, a new random variate Yn that is the sum of all Xn will for n→∞ be a variate that is asymptotically normally distributed. 2 2 -2 -1.2 -0.4 0.4 1.2 X 0.25 0.2 0.15 0.1 0.05 0 -2 -1.2 -0.4 0.4 1.2 X 0.15 0.1 0.05 0 -2 -1.2 -0.4 0.4 1.2 X 2 2 2 X X X 0.06 0.05 f(x) 0.04 0.03 0.02 1 f ( x) e 2 0.01 ( x )2 2 2 0 0 0.5 1 1.5 2 2.5 X 3 3.5 4 4.5 5 4 4.5 5 The normal or Gaussian distribution 1.2 1 f(x) 0.8 F ( x) 0.6 0.4 1 2 x e ( v )2 2 2 dv 0.2 0 0 0.5 1 1.5 2 2.5 X Mean: Variance: 2 3 3.5 • • Important features of the normal distribution The function is defined for every real x. The frequency at x = m is given by 1 0.4 p( x ) 2 • • The distribution is symmetrical around m. The points of inflection are given by the second derivative. Setting this to zero gives ( x ) x X X X 0.06 0.05 f(x) 0.04 - 0.03 0.02 -2 0.01 0.68 0.95 + +2 0 0 1 2 e 0.5 1 x 2 2 1 x 2 1 1 2 2 e 2 1 e 2 2 1 2 e 2 1 2 e 1 1.5 2 2.5 X 3 3.5 F ( x) 4 1 2 4.5 x e ( v )2 2 2 5 dv 0.68 1 x 2 1 x 2 2 1 x 2 2 2 0.95 0.5 0.975 Many statistical tests compare observed values with those of the standard normal distribution and assign the respective probabilities to H1. The Z-transform 1 f ( x) e 2 1 x 2 The standard normal f ( x) 1 2 Z 2 e 2 1 2 x Z The variate Z has a mean of 0 and and variance of 1. A Z-transform normalizes every statistical distribution. Tables of statistical distributions are always given as Ztransforms. The 95% confidence limit 0.1 0.04 0.06 0.04 0.02 0 0.05 0 0 2 0.02 0 4 6 8 10 0 3 6 9 12 15 18 0 6 12 18 24 30 36 42 48 The Z-transformed (standardized) normal distribution X X X 0.06 0.05 f(x) 0.04 - 0.03 0.02 -2 0.01 0.68 0.95 + +2 0 0 0.5 1 1.5 2 2.5 X 3 P( - < X < + ) = 68% P( - 1.65 < X < + 1.65) = 90% P( - 1.96 < X < + 1.96) = 95% P( - 2.58 < X < + 2.58) = 99% P( - 3.29 < X < + 3.29) = 99.9% 3.5 4 4.5 5 The Fisherian significance levels The estimation of the population mean from a series of samples x,s x,s n xi n x n i n n x Z i 1 n i 1 n 2 si n x,s x,s x,s i 1 n=10 0.25 x,s f(x) f(x) 0.2 0.15 0.1 , 0.05 0 0 2 4 6 8 10 x,s 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.12 n=20 n=50 0.1 Zx n 0.08 f(x) 0.3 x,s The n samples from an additive random variate. Z is asymptotically normally distributed. 0.06 0.04 0.02 0 0 3 X 6 9 12 15 18 X 0 6 12 18 24 30 36 42 48 X 0.06 x n 0.05 f(x) 0.04 - 0.03 0.02 -2 0.01 0.68 0.95 Standard error + Confidence limit of the estimate of a mean from a series of samples. +2 0 0 0.5 1 1.5 2 2.5 X 3 3.5 4 4.5 5 is the desired probability level. How to apply the normal distribution Intelligence is approximately normally distributed with a mean of 100 (by definition) and a standard deviation of 16 (in North America). For an intelligence study we need 100 persons with an IO above 130. How many persons do we have to test to find this number if we take random samples (and do not test university students only)? F ( x 130) 1 2 e ( v )2 2 2 dv 1 130 a ( z ) F ( x a) 1 2 2 130 ( v ) 2 2 e dv 0.03 0.025 f(IQ) 0.02 0.015 0.01 IQ<130 IQ>130 0.005 0 40 60 80 100 IQ 120 140 160 One and two sided tests We measure blood sugar concentrations and know that our method estimates the concentration with an error of about 3%. What is the probability that our measurement deviates from the real value by more than 5%? Albinos are rare in human populations. Assume their frequency is 1 per 100000 persons. What is the probability to find 15 albinos among 1000000 persons? 1000000 15 999985 p ( X 15) (0.00001) (0.99999) 15 =KOMBINACJE(1000000,15)*0.00001^15*(1-0.00001)^999985 = 0.0347 np 2 npq