Download Example of empirical statistics of a set of data BA_FSM 2014/2015

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Example of empirical statistics of a set of data BA_FSM 2014/2015
Data of body heights of n = 46 girls – students of the University of Finance nad
Administration
Tab. 1.
č. body height, cm
89
151
34
157
51
158
94
158
32
160
41
161
83
162
31
163
81
163
4
164
33
164
37
164
87
164
88
164
7
165
3
165
39
165
84
165
96
165
49
166
44
167
91
167
48
167
č. body height, cm
90
167
1
168
45
168
40
168
82
168
92
168
95
170
2
170
85
170
35
170
80
170
50
171
36
172
6
173
46
173
47
173
38
175
43
176
93
176
86
176
42
177
5
180
97
185
Range of the set
R = xmax – xmin = 185 - 151 = 34 cm
Kvantils
specifically median and quartils
x25  164 cm
x50  167 cm
x75  171,5 cm
Arithmetical average (mean value)
k
x
151  157  2  158  160  161  162  2  163  5  164  5  165  166  4  167

n
46
5  168  5  170  171  172  3  173  175  3  176  177  180  185
+
 167,59  167,6
46
x
i 1
i

This calculation and the following procedures can be simplified by division our data into
smaller number of groups (intervals). Sturges´ rule according to which we calculate
aproximative number of intervals k has the form
k = 1 + 3,3 log n
1
in our case k = 1 + 3,3 log 46 = 6,48  6 . We build 6 intervals.
Using the range of set R we estimate the width of one interval 34/6 = 5,67  5 cm ,(better than
6 because of the distant values on both ends which shift the mean value to higher values).
1. interval till 157 cm
2. interval 158 – 162
3. interval 163 – 167
4. interval 168 – 172
5. interval 173 – 177
6. interval 178 cm and more
As representative value of body height in respective intervals we choose the center of the
intervals. For the outer intervals we take the same distance.
Tab.2.
Interval center of interval, cm
do 157
155
158-162
160
163-167
165
168-172
170
173-177
175
xi
1
2
3
4
5
ni
2
5
17
12
8
ni / n
0,043
0,109
0,370
0,261
0,174
Σ ni / n
0,043
0,152
0,522
0,783
0,957
178 a více
6
2
0,043
1,000
46
1
180
∑
column marked xi
column marked ni
column marked ni / n
– scale elements
– absolute frequencies of scale elements
– relative frequencies of scale elements
column marked Σ (ni / n) – cumulative frequencies
Polygon relativních četností
ni/n
0,40
0,35
0,30
0,25
0,20
0,15
0,10
0,05
0,00
1
2
3
4
5
6
xi
Fig. 1. Polygon of the relative frequencies
2
Polygon kumulativních
relativních četností
∑ni/n
1,00
0,90
0,80
0,70
0,60
0,50
0,40
0,30
0,20
0,10
0,00
1
2
3
4
5
6
xi
Fig. 2. Polygon of the cumulative relative frequencies
Arithmetical average (mean value)
(using the representative values of respective intervals)
6
x
 n .x
i
i 1
i
46

2  155  5  160  17  165  12  170  8  175  2  180
 167,72  167,7 cm
46
or alternatively in the values of the new scale
6
x
 n .x
i 1
i
46
i

1 2  2  5  3  17  4  12  5  8  6  2
 3,543
46
Both mean values in cm differs slightly as we used the interval centers instead the mean
values of intervals.
Variance S x2 and Standard deviation S x (using interval centers)
k
S 
 n . x
i
i
 x
2
2  155  167,7   5  (160  167,7) 2  17  (165  167,7) 2  12  (170  167,7) 2
2

n
8  (175  167,7) 2  2  (180  167,7) 2

 33,38 cm
46
alternatively
2
x
i 1
46

2
213,543 5(23,543)2 17(33,543)2 12(43,543)2 8(53,543)2  2(63,543)2
2
Sx 
 1, 337
46
Standard deviation
Sx  Sx2  5,777 cm, alternatively Sx = 1,156.
3
Standard deviation shows which weight the value of arithmetical average has: If the value of
the standard deviation is high, the weight of arithmetical average is low and vice versa.
Moments
We calculate moments of our set of data (experimental moments) and we compare them with
corresponding moments of the standard normal (Gaussian) distribution. To calculate this
moments we use scale elements and frequencies of scale elements as are shown in Tab. 1.
We distiguish:
General moments, i.e. parameter of position (location of ditribution),
Central moments, i.e. parameter of variance (width of ditribution),
Standardized moments, i.e. parameters of skewness and of kurtosis.
In order to calculate the general moments we form the Tab. 2 containing the following
columns:
1. column contains xi
2. column contains ni
3. column contains the products xi ni
4. column contains the products ni x 2i
5. column contains the products ni x 3i
6. column contains the products ni x 4i
Tab.3.
xi
ni
ni x 2i
ni xi
ni x 3i
ni x 4i
1
2
2
2
2
2
5
10
20
40
2
80
3
17
51
153
459
1377
4
12
48
192
768
3072
5
8
40
200
1000
5000
6
2
12
72
432
2592
∑
46
163
639
2701
12123
General moments
6
O1 
n x
i i
i 1

n
163
 3,543
46
6
O2 
n x
i 1
i
n
2
i

639
 13,89
46
4
6
O3 
n x
i
i 1
3
i
n

2701
 58,72
46

12123
 263,5
46
6
O4 
n x
4
i i
i 1
n
The general moment of the first order O1 = 3,543 is actually arithmetical average x
expressed in elements of scale 1 to 6.
It is simple to transform this value to cm: (a) 3,544 belongs to the 3rd interval with central
value 165 cm, (b) to 165 cm we add 0,544 part of the width of interval i.e. 0,544  5 cm, (c)
addition of (a) and (b) leads to average value 167,7cm.
The general moment of the first order O1 is parameter of position. Body heights are around
the mean value 167,7 cm.
The central moments can be simply calculated using the general moments.
Central moments
6
C1 
n x  x
i 1
1
i
i
 O1  O1  0
n
6
C2 
n x  x
i 1
i
n x  x
i 1
i
3
i
 O3  3O2O1  2O13  0,03268
n
6
C4 
 O2  O12  1,337
n
6
C3 
2
i
n x  x
i 1
i
i
n
4
 O4  4O3O1  6O2O12  3O14  4,753
Central moments are calculated with regard to central value of x (aritmetical average).
Sumation of deviations from x to higher values (positive) and sumation of deviations from
x to lower values (negative) have the same absolute value but opposite sings, thus central
moment of the first order, C1 , is always equal to zero.Central moment of the second order, C2
, is the variance S x2 and it is the parametr of width. S x  C2 is called standard deviation.
In our case it holds
C2 = 1,1554. This value is again expressed in elements of scale 1 to 6.
5
To transform this value to cm we calculate Sx cm  1,1554  5 cm = 5,777 cm . This
corresponds with above obtained value 5,78 cm.
Central moments of the third nad fourth order we use for calculation further empirical
parameters.
Standardized moments
Parameter of skewness is calculated using standardized moment of the third order N 3 and is
called coefficient of skewness.
C3
N3 
 0,215
C2  C2
If the coeficient of skewness is positive, the elements of scale on the left hand side from the
average have higher frequency (posively skewed distribution of frequencies – higher
contration of smaller elements of scale) and vice versa for N 3 negative.
Present distribution is slightly positively skewed, it means that there are more girls smaller
than average body height 167,7 cm – see the input data.
Parameter of pointedness is usually determined using standardized moment of the fourth order
N 4 nad is called coefficient of kurtosis.
N4 
C4
 2,717
C22
The higher N 4 is the more poited is the distribution of frequencies at given variance.
The distribution of frequencies is usually compared with standardized normal distribution.
Standardized moments are showing in which features our data differ from the normalized
normal (Gauss) distribution. It is also used quantity „excess“ which is defined by relation
Ex  N 4  3 . If E x is positive, the studied empirical distribution is more pointed than
standardized normal distribution, if E x is negative, the studied empirical distribution is more
flat than standardized normal distribution. It holds in our case: Ex  N 4  3  0,283 It means
that our distribution of frequencies is more flat than standardized normal distribution.
The sense of standard deviation for theoretical normal distribution
interval x  Sx ; x  Sx contains 68% of all values x
interval x  2Sx ; x  2Sx contains 95% of all values x
interval x  3Sx ; x  3S x contains 99% of all values x
It holds in our case:
167,7  5,8 cm. Body heights of 33 girls belong to this interval, i.e. 71,7% of girls.
167,7  11,6 cm. Body heights of 43 girls belong to this interval, i.e. 93,5% of girls.167,7 
17,4 cm. Body heights of all 46 girls belong to this interval, i.e. 100% of girls.
Conclusion: The empirical statistics of our data leads to the conclusion that our data are
close to the theoretical normal distribution.
6