Download Moments and distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lecture 9
Moments of distributions
Body size distribution of European
Collembola
Body size distribution of European Collembola
Body
Species
weight
[mg]
Tetrodontophora bielanensis (Waga 1842)
13.471729
Orchesella chiantica Frati & Szeptycki 1990
13.471729
Disparrhopalites tergestinus Fanciulli, Colla, Dallai 2005
12.924837
Orchesella dallaii Frati & Szeptycki 1990
9.4503028
Seira pini Jordana & Arbea 1989
9.4503028
Isotomurus pentodon (Kos,1937)
7.1044808
Heteromurus (V.) longicornis (Absolon 1900)
7.1044808
Pogonognathellus flavescens (Tullberg 1871)
6.9512714
Orchesella hoffmanni Stomp 1968
6.9512714
Heteromurus (H) constantinellus Lučić, Ćurčić & Mitić 2007 6.3862223
Pogonognathellus longicornis (Müller 1776)
6.2133935
Orchesella devergens Handschin 1924
6.2133935
Orchesella flavescens (Bourlet 1839)
6.2133935
Orchesella quinquefasciata (Bourlet 1841)
6.2133935
Number of species
500
Modus
Collembola
400
300
200
100
0
-4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25
ln body weight class
ln body
Number
ln
weight [mg]
of
weight
class means species
2.6006
-4.71511
7
2.6006 -4.018377
53
2.5592 -3.321643
133
2.246 -2.624909
224
2.246 -1.928176
353
1.9607 -1.231442
395
1.9607 -0.534708
325
1.9389 0.162025
126
1.9389 0.858759
45
1.8541 1.555493
24
1.8267 2.252226
9
1.8267
1.8267
1.8267
The histogram of raw data
Three Collembolan
weight classes
Class 1
N
25
Mean 1.8169079
2.6005933
2.5591508
2.2460468
2.2460468
1.9607257
1.9607257
1.9389246
1.9389246
1.8541429
1.8267072
1.8267072
1.8267072
1.8267072
1.8267072
1.584378
1.584378
1.584378
1.584378
1.584378
1.584378
1.5326904
1.5326904
1.5064044
1.4529137
1.4529137
Class 2
31
1.032923
1.313477
1.313477
1.313477
1.313477
1.313477
1.301948
1.225568
1.165038
1.165038
1.165038
1.165038
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
1.006355
0.939683
0.871022
0.871022
0.835906
0.835906
0.800247
0.800247
0.764026
0.756712
0.727225
Class 3
43
0.531059
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.651808
0.613152
0.573835
0.573835
0.533834
0.493125
0.493125
0.493125
0.493125
0.493125
0.489014
0.451682
0.451682
0.451682
0.451682
0.409479
What is the average body weight?
n
n

x
i 1
i
x
x
i 1
i
n
n
Population mean
Sample mean
Weighed mean
x
25
31
43
1.812  1.033  0.531  1.013
99
99
99
k
k
ni
1 k
x   xi ni   xi   xi f (i )
n i 1
n i 1
i 1
Number of species
0.25
0.2
f ( x1 ) 
Weighed mean
Collembola
ni
n
n
k
k
xi
ni xi
x 
  xi f ( xi )
i 1 n
i 1 n
i 1
0.15
0.1
0.05
Discrete distributions
0
-4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25
ln body weight class
ln body
Number
weight [mg]
of
class means species
-4.72
-4.02
-3.32
-2.62
-1.93
-1.23
-0.53
0.16
0.86
1.56
2.25
7
53
133
224
353
395
325
126
45
24
9
Sum
1694
Frequency
Arithmetic
mean
=B2/B14
0.031286895
0.078512397
0.132231405
0.208382527
0.233175915
0.191853601
0.074380165
0.026564345
0.014167651
0.005312869
=A2*C2
=(A2-D14)^2*C2
-0.125723
0.202268085
-0.26079
0.267516588
-0.347095
0.174619987
-0.401798
0.042653444
-0.287143
0.013917567
-0.102586
0.169898317
0.0120514
0.199510727
0.0228124
0.144774029
0.0220377
0.130178627
0.0119658
0.073837264
-1.475751
StDev
Variance
1.462535979
1.209353538
The average European springtail has
a body weight of e-1.476 = 023 mg.
Most often encounted is a weight
around e-1.23 = 029 mg.
Continuous distributions

max
 xf ( x)dx
min
Why did we use log transformed values?
Average
Body
body length weight
[mm]
[mg]
Species
Tetrodontophora bielanensis (Waga 1842)
Orchesella chiantica Frati & Szeptycki 1990
Disparrhopalites tergestinus Fanciulli, Colla, Dallai 2005
Orchesella dallaii Frati & Szeptycki 1990
Seira pini Jordana & Arbea 1989
Isotomurus pentodon (Kos,1937)
Heteromurus (V.) longicornis (Absolon 1900)
Pogonognathellus flavescens (Tullberg 1871)
Orchesella hoffmanni Stomp 1968
Heteromurus (H) constantinellus Lučić, Ćurčić & Mitić 2007
Pogonognathellus longicornis (Müller 1776)
Orchesella devergens Handschin 1924
Orchesella flavescens (Bourlet 1839)
Orchesella quinquefasciata (Bourlet 1841)
Log transformed data
Collembola
400
300
200
100
0
-6.00
13.472
13.472
12.925
9.4503
9.4503
7.1045
7.1045
6.9513
6.9513
6.3862
6.2134
6.2134
1.875
6.2134
6.2134
=JEŻELI(B86=0;0;EXP(-1.875+LN(B86)*2.3))
W [mg]  e
Linear data
500
Number of species
Number of species
500
7
7
6.875
6
6
5.3
5.3
5.25
5.25
5.06
5
5
5
5
5
[W / L]L[mm]2.3
Collembola
400
The distribution is
skewed
300
200
100
0
-4.00
-2.00
0.00
ln body weight class
2.00
4.00
0
2
4
6
Body weight class
8
10
W [mg ]  e 1.875[W / L]L[mm]2.3
Body weight Number
[mg] class
of
means
species
W  W0 Lz
ln W  ln W0  z ln L
Number of species
500
Collembola
400
300
200
100
0
0
2
4
6
Body weight class
n
n
n
x
i
e
ln xi
i 1
i 1
n
8
10
0.01
0.02
0.04
0.07
0.15
0.29
0.59
1.18
2.36
4.74
9.51
7
53
133
224
353
395
325
126
45
24
9
Sum
Exp()
1694
lb scaled weight
classes
Frequency
Arithmetic
mean
Geometric
mean
0.004132231
0.031286895
0.078512397
0.132231405
0.208382527
0.233175915
0.191853601
0.074380165
0.026564345
0.014167651
0.005312869
3.702E-05
0.0005626
0.0028338
0.0095797
0.0303016
0.0680574
0.1123956
0.0874629
0.062698
0.0671181
0.0505194
-0.019483926
-0.125722539
-0.260790153
-0.347095405
-0.401798187
-0.287142615
-0.102585655
0.012051446
0.02281237
0.022037681
0.011965782
0.491566
-1.4757512
0.228606933
The average European
springtail has a body weight of
e-1.476 = 023 mg.
Geometric mean
In the case of exponentially distributed data we have to use the geometric mean.
To make things easier we first log-transform our data.
ln body
Number
weight [mg]
of
class means species
-4.72
-4.02
-3.32
-2.62
-1.93
-1.23
-0.53
0.16
0.86
1.56
2.25
7
53
133
224
353
395
325
126
45
24
9
Sum
1694
Frequency
=B2/B14
0.031286895
0.078512397
0.132231405
0.208382527
0.233175915
0.191853601
0.074380165
0.026564345
0.014167651
0.005312869
Arithmetic
mean
=A2*C2
=(A2-D14)^2*C2
-0.125723
0.202268085
-0.26079
0.267516588
-0.347095
0.174619987
-0.401798
0.042653444
-0.287143
0.013917567
-0.102586
0.169898317
0.0120514
0.199510727
0.0228124
0.144774029
0.0220377
0.130178627
0.0119658
0.073837264
-1.475751
StDev
1.462535979
1.209353538
Mean
Number of species
0.25
0.2
f ( x1 ) 
i 1
n 1
2 
 (x
i 1
i
 )2
n
Degrees of freedom
Variance
n
s   ( xi  x) 2 f ( xi )
2
i 1
Continuous distributions
s2 
2
(
x

x
)
f ( x)dx

min
0.15
0.1
s2 
 ( xi  x ) 2
max
Collembola
ni
n
n
n
Variance
1 SD
s  s2
Standard deviation
0.05
0
-4.72 -4.02 -3.32 -2.62 -1.93 -1.23 -0.53 0.16 0.86 1.56 2.25
ln body weight class
The standard deviation is a measure
of the width of the statistical
distribution that has the sam
dimension as the mean.
Mean
Variance
Standard
deviation
5.66
10.45
3.23
The standard deviation as a measure of errors
Distance
1
2
3
4
5
6
7
8
9
10
Average NOx Standard
concentration deviation
9.53
1.70
7.37
1.18
5.24
0.86
3.15
0.26
2.17
0.18
1.05
0.09
0.84
0.14
0.63
0.10
0.32
0.03
0.21
0.02
The precision of
derived metrics
should always
match the
precision of the
raw data
Concentration
Environmental
pollution
Station NOx [ppm]
1
8.49
2
1.12
3
9.11
4
7.75
5
0.75
6
8.23
7
0.97
8
6.06
9
8.48
10
5.88
11
8.51
12
9.62
13
3.35
14
7.74
15
2.03
16
5.06
17
7.61
18
0.99
19
2.55
20
8.91
± 1 standard deviation is the
most often used estimator of
error.
The probablity that the true
mean is within ± 1 standard
deviation is approximately 68%.
The probablity that the true
mean is within ± 2 standard
deviations is approximately 95%.
14
12
10
8
6
4
2
0
± 1 standard deviation
1
2
3
4
5
6
Distance [km]
7
8
9
10
Standard deviation and standard error
Mean
Standard
deviation
5.44
4.15
4.49
5.29
5.55
3.39
5.56
3.13
The standard deviation is constant irrespective of
sample size.
The precision of the estimate of the mean should
increase with sample size n.
The standard error is a measure of precision.
SE 
Average NOx Standard
Distance
concentration deviation
1
2
3
4
5
6
7
8
9
10
9.53
7.37
5.24
3.15
2.17
1.05
0.84
0.63
0.32
0.21
3.32
2.45
1.24
0.67
0.87
0.34
0.14
0.10
0.03
0.02
Standard
error
n=20
0.74
0.55
0.28
0.15
0.19
0.08
0.03
0.02
0.01
0.01
SD
n
12
10
Concentration
Environmental
pollution
NOx
Station
[ppm]
1
8.49
2
1.12
3
9.11
4
7.75
5
0.75
6
8.23
7
0.97
8
6.06
9
8.48
10
5.88
11
8.51
12
9.62
13
3.35
14
7.74
15
2.03
16
5.06
17
7.61
18
0.99
19
2.55
20
8.91
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
Distance [km]
Central moments
n
n
n
n
i 1
i 1
s   ( xi  x) f ( xi )   ( xi ) f ( xi )  2 xi x f ( xi )   ( x) 2 f ( xi )
2
2
2
i 1
i 1
n
n
s   ( xi ) f ( xi )  2 x x  ( x) 1   ( xi ) 2 f ( xi )  x
2
2
i 1
n
2
2
i 1
 n 
xi   xi 

s 2  i 1
  i 1 
n 1  n 1 




2
E(x2)
2
[E(x)]2
Mathematical expectation
First central moment
First moment of central tendency
The variance is the difference between the mean of the squared values and
the squared mean
k-th central moment
  E( X )
n
E ( X )   xi f ( xi )
k
k
i 1

E( X ) 
k


k
x f ( x)dx
 2  E ( x 2 )  E ( x) 2
n
2
2
2
(
X


)
f
(
X
)

E
((
X


)
)


 i
i
i 1
Third central moment
E(( X   )3 )  E( X 3 )  3 E( X 2 )  3 2 E( X )   3  E( X 3 )  3 E( X 2 )  2 3
Skewness
f(x)
2
4
x
6
8
0
1000
x
1500
Kurtosis
1
0.8
0.6
0.4
0.2
0
1
2
4
x
6
8
1.5
x
2
Left skewed distribution
=0
0
<0
1
0.8
0.6
0.4
0.2
0
2000
Right skewed distribution
Symmetric distribution
( X   )4
  E(
)3
4

500
f(x)
0
>0
1
0.8
0.6
0.4
0.2
0
f(x)
=0
1
0.8
0.6
0.4
0.2
0
f(x)
f(x)
E (( X   )3 )

3
1
0.8
0.6
0.4
0.2
0
>0
0
2
4
x
6
8
Lecture 10
Important statistical distributions
What is the probability that of 10 newborn babies at least 7 are boys?
p(girl) = p(boy) = 0.5
Bernoulli distribution
0.3
 n  k nk
p(k )    p q
k 
0.25
p(X)
0.2
0.15
n
p
0.1
0.05
i 0
0
0
2
4
6
8
i
1
10
X
10  7 3 10  8 2 10  9 1 10  10 0
p(k  6)   0.5 0.5   0.5 0.5   0.5 0.5   0.5 0.5  0.172
7
8
9
10 
Bernoulli or binomial distribution
 n  k nk
p(k )    p q
k 
0.35
0.3
 n  x n x
F (k )  p( x  k )     p q
x 0  x 
k
  np
f(p)
0.25
0.2
0.15
0.1
0.05
0
0
 2  npq
1
2
3
4
5
p
6
7
8
10 
p(k )    0.2k 0.810k
k 
The Bernoulli or binomial distribution comes from the Taylor expansion of the
binomial
 n  i n 1 n  n  i
( p  q)     p q     p (1  q) n 1
i 0  i 
i 0  i 
n
n
9
10
Assume the probability to find a certain disease in a tree population is 0.01. A biomonitoring program surveys 10 stands of trees and takes in each case a random sample of
100 trees. How large is the probability that in these stands 1, 2, 3, and more than 3 cases of
this disease will occur?
1000 
0.01* 0.99999  0.0004
p (1)  
 1 
1000 
0.012 * 0.99998  0.0022
p (2)  
 2 
1000 
0.013 * 0.99997  0.0074
p (3)  
 3 
Mean, variance, standard deviation
  1000 * 0.01  10
 2  1000 * 0.01* 0.99  9.9
  9.9  3.146
1000 
0.0100.991000 
p(k  3)  1  p(k  3)  1   0.01i 0.99n i  1  
i 0
 0 
1000 
1000 
1000 
0.0110.99999  
0.0120.99998  
0.0130.99997  0.99
 
 1 
 2 
 3 
3
 n  k nk
p(k )    p q
k 
What happens if the number of trials n becomes larger and larger and p the event
probability becomes smaller and smaller.
  np 
rp

r
 p
 q  1 p 
1 p
r
r


(r  k )!
k
rr
k 
1
(r  k )!
p( X  k ) 

k !(r  1)! (r   ) k (r   ) r k !    r (r  1)!(r   ) k
 1  
r

lim r 
1
 e 
 
1  
 r
(r  k )!
lim r 
1
k
(r  1)!(r   )
r
p( X  k ) 
k
k!






e 
Poisson distribution
The distribution or rare events
Assume the probability to find a certain disease in a tree population is 0.01. A biomonitoring program surveys 10 stands of trees and takes in each case a random sample of
100 trees. How large is the probability that in these stands 1, 2, 3, and more than 3 cases of
this disease will occur?
Poisson solution
  1000 * 0.01  10
Bernoulli solution
p (1)  0.0004
10 10
e  0.00045
1!
10 2 10
p(2) 
e  0.0023
2!
103 10
p(3) 
e  0.0076
3!
p(1) 
p ( 2)  0.0022
p (3)  0.0074
The probability that no infected tree will be detected
100 10
p(0) 
e  e 10  0.000045
0!
p(0)  e  
The probability of more than three infected trees
Bernoulli solution
p(0)  p(1)  p(2)  p(3)  0.00045  0.0023  0.0076  0.019
p(k  3)  1  0.019  0.981
p(k  3)  0.99
p(k)
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
=1
=2
=3
0
1
2
3
4
=4
5
=6
6
7
8
9
10 11 12 13
k
 
2
Variance, mean

1

Skewness
What is the probability in Duży Lotek to have three times cumulation
if the first time 14 000 000 people bet, the second time 20 000 000,
and the third time 30 000 000?
The probability to win is
p(6) 
6!43!
1

49! 14000000
1
1  14000000
1
14000000
1
 2  20000000
 1.428571
14000000
1
3  30000000
 2.142857
14000000
The probability of at least one event:
p(k  1)  1  e  
10 1
p1  e  0.368
0!
1.4285710 1.428571
p2 
e
 0.239
0!
2.142857 0  2.142857
p3 
e
 0.117
0!
The zero term of the Poisson distribution
gives the probability of no event
The events are independent:
p1, 2,3  0.368 * 0.239 * 0.117  0.01
A pile model to generate the binomial.
If the number of steps is very, very large the binomial becomes smooth.
Abraham de Moivre
(1667-1754)
f ( x)  Ce
The normal distribution is the
continous equivalent to the discrete
Bernoulli distribution
f ( x) 
1
e
 2
( x 2 )
1  x 
 

2  
2
-2 -1.2 -0.4 0.4 1.2
X
0.05
0.04
0.03
0.02
0.01
0
2
Frequency
Frequency
0.05
0.04
0.03
0.02
0.01
0
0.06
0.04
0.02
0
-2 -1.2 -0.4 0.4 1.2
X
Frequency
0.15
0.1
0.05
0
-2 -1.2 -0.4 0.4 1.2
X
Frequency
Frequency
Frequency
The central limit theorem
If we have a series of random variates Xn, a new random variate Yn that is the sum of all Xn
will for n→∞ be a variate that is asymptotically normally distributed.
2
2
-2 -1.2 -0.4 0.4 1.2
X
0.25
0.2
0.15
0.1
0.05
0
-2 -1.2 -0.4 0.4 1.2
X
0.15
0.1
0.05
0
-2 -1.2 -0.4 0.4 1.2
X
2
2
2
X
X
X
0.06
0.05
f(x)
0.04
0.03
0.02
1
f ( x) 
e
 2
0.01
 ( x   )2
 
2 2




0
0
0.5
1
1.5
2
2.5
X
3
3.5
4
4.5
5
4
4.5
5
The normal or Gaussian distribution
1.2

1
f(x)
0.8
F ( x) 
0.6
0.4
1
 2
x
e
 ( v   )2 
 
2 

 2 
dv

0.2
0
0
0.5
1
1.5
2
2.5
X
Mean: 
Variance: 2
3
3.5
•
•
Important features of the normal distribution
The function is defined for every real x.
The frequency at x = m is given by
1
0.4
p( x   ) 

 2 
•
•
The distribution is symmetrical around m.
The points of inflection are given by the second
derivative. Setting this to zero gives
( x   )    x    
X
X
X
0.06
0.05
f(x)
0.04
-
0.03
0.02
-2
0.01
0.68
0.95
+
+2
0
0

1
 2 e

0.5
1  x 
 

2  
2
1  x 

 
2
1
 
1
2
  2 e 
2
1
e

 2  2

1
 2 e
2
1
 2 e
1
1.5
2
2.5
X
3
3.5
F ( x) 
4
1
 2
4.5
x
e
 ( v   )2 
 
2 

 2 
5
dv

 0.68
1  x 
 

2  
1  x 
 

2  
2
1  x 
 

2  
2
2
 0.95
 0.5
 0.975
Many statistical tests compare observed values with
those of the standard normal distribution and assign
the respective probabilities to H1.
The Z-transform
1
f ( x) 
e
 2
1  x 
 

2  
The standard normal
f ( x) 
1  2  Z 2
e
2
1
2
 x 
Z 




The variate Z has a mean of 0 and and
variance of 1.
A Z-transform normalizes every statistical distribution.
Tables of statistical distributions are always given as Ztransforms.
The 95%
confidence limit
0.1
0.04
0.06
0.04
0.02
0
0.05
0
0
2
0.02
0
4
6
8
10
0 3 6 9 12 15 18
0 6 12 18 24 30 36 42 48
The
Z-transformed
(standardized)
normal distribution
X
X
X
0.06
0.05
f(x)
0.04
-
0.03
0.02
-2
0.01
0.68
0.95
+
+2
0
0
0.5
1
1.5
2
2.5
X
3
P( -  < X <  + ) = 68%
P( - 1.65 < X <  + 1.65) = 90%
P( - 1.96 < X <  + 1.96) = 95%
P( - 2.58 < X <  + 2.58) = 99%
P( - 3.29 < X <  + 3.29) = 99.9%
3.5
4
4.5
5
The Fisherian
significance levels
The estimation of the population mean from a series of
samples
x,s
x,s
 n xi

n




x

n



i
n
  n x   
Z  i 1 n
  i 1

n
2
 si
n
x,s
x,s
x,s
i 1
n=10
0.25
x,s
f(x)
f(x)
0.2
0.15
0.1
,
0.05
0
0
2
4
6
8
10
x,s
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0

0.12
n=20
n=50
0.1

Zx
n
0.08
f(x)
0.3
x,s
The n samples from an additive random
variate.
Z is asymptotically normally distributed.
0.06
0.04
0.02
0
0
3
X
6
9 12 15 18
X
0 6 12 18 24 30 36 42 48
X
0.06
  x 

n
0.05
f(x)
0.04
-
0.03
0.02
-2
0.01
0.68
0.95
Standard error
+
Confidence limit of the estimate
of a mean from a series of
samples.
+2
0
0
0.5
1
1.5
2
2.5
X
3
3.5
4
4.5
5
 is the desired probability level.
How to apply the normal distribution
Intelligence is approximately normally distributed with a mean of 100 (by definition)
and a standard deviation of 16 (in North America). For an intelligence study we need
100 persons with an IO above 130. How many persons do we have to test to find this
number if we take random samples (and do not test university students only)?
F ( x  130) 
1
 2


e
 ( v   )2 
 
2 

 2 
dv  1 
130
a 
( z )   
  F ( x  a)



1
 2
2

130   ( v   ) 
 2 2 


e
dv

0.03
0.025
f(IQ)
0.02
0.015
0.01
IQ<130
IQ>130
0.005
0
40
60
80
100
IQ
120
140
160
One and two sided tests
We measure blood sugar concentrations and know that our method estimates the
concentration with an error of about 3%. What is the probability that our
measurement deviates from the real value by more than 5%?
Albinos are rare in human populations.
Assume their frequency is 1 per 100000 persons. What is the probability to find 15
albinos among 1000000 persons?
1000000 
15
999985
p ( X  15)  
(0.00001)
(0.99999)

 15 
=KOMBINACJE(1000000,15)*0.00001^15*(1-0.00001)^999985 = 0.0347
  np
 2  npq
Related documents