Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Description (1)
Xiaojin Yu
Introductory biostatistics
http://www.hstathome.com/tjziyuan/Introductory%20Biostatistics%
20Le%20C.T.%20%20(Wiley,%202003)(T)(551s).pdf
Introductory biostatistics for the health science
http://faculty.ksu.edu.sa/hisham/Documents/eBooks/Introductory_
Biostatistics_for_the_Health.pdf
Review
What is Medical statistics about?
key terms in Statistics
2
1.2 key words in Statistics
Population(individual)
& sample
Variation & random variable
Random Variable & data
Statistic & parameter
Sampling error
Probability
3
Framework of statistical analysis
population
Randomly
sampling
sample
individual, variation
representative,
sampling error
parameter
Statistics
unknown
Statistical
inference
based on
probabililty
known
Statistical Description
4
Statistical Description
CONTENTS
For quantitative(numerical) data
Frequency distribution
Measures of central tendency
Measures of dispersion
For qualitative(categorical) data
5
Raw Data (quantitative)
Example: 120 values of height (cm) for 12-year-old boys in 1997:
142.3
134.4
150.3
141.9
143.5
138.1
142.9
140.9
134.7
141.2
135.5
140.2
156.6
148.8
133.1
140.7
139.2
140.2
134.9
141.4
138.5
148.9
144.4
145.4
142.7
137.9
142.7
141.2
144.7
137.4
143.6
160.9
138.9
154.0
143.4
142.4
145.7
151.3
143.9
141.5
139.3
145.1
142.3
154.2
137.7
147.7
137.4
148.9
138.2
140.8
151.1
148.8
141.9
145.8
125.9
137.9
138.5
152.3
143.6
146.7
141.6
149.8
144.0
140.1
147.8
147.9
132.7
139.9
139.6
146.6
150.0
139.2
142.5
145.2
145.4
150.6
140.5
150.8
152.9
149.7
143.5
132.1
143.3
139.6
130.5
141.8
146.2
139.5
138.9
144.5
147.9
147.5
142.9
145.9
146.5
142.4
134.5
146.8
143.3
146.4
134.7
137.1
141.8
136.9
129.4
146.7
149.0
138.7
148.8
135.1
156.3
143.8
147.3
147.1
141.4
148.1
142.5
144.0
142.1
139.9
6
Data Summary
For continuous variable data
Numerical methods
Description of tendency of central
Description of dispersion
Tabular and graphical methods
7
Tabular & Graphical Methods
Frequency table
Histogram
8
FREQUENCY TABLE
Class Interval
for Height (cm)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
Frequency
(f)
1
2
10
22
37
26
15
4
2
1
120
Relative
Frequency
0.0083
0.0167
0.0833
0.1834
0.3083
0.2167
0.1250
0.0333
0.0167
0.0083
1.0000
9
SOLUTION TO EXAMPLE
1.number of intervals
k=10
2 calculate the width
R=Xmax-Xmin= 160.9- 125.9=35
w=R/k
W=35/10=3.5
3.form the intervals
4.counting frequency
A recommended step is to present the
proportion or relative frequency.
10
Class intervals
Class Interval
for Height (cm)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
11
Tally and Counting
Class Interval
for Height (cm)
Frequency
(f)
124~
128~
1
2
10
22
37
26
15
4
2
1
120
132~
136~
140~
144~
148~
152~
156~
160~
Total
Tally mark
一
T
正正
正正正正 T
正正正正正正正T
正正正正正一
正正正
T
一
12
12
Final Frequency Table
Class Interval
for Height
(cm)
Frequency
(f)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
1
2
10
22
37
26
15
4
2
1
120
Relative
Frequency
Cumulative
frequency
Cumulative
rela. freq(%)
0.0083
0.0167
0.0833
0.1834
0.3083
0.2167
0.1250
0.0333
0.0167
0.0083
1.0000
1
3
13
35
72
98
113
117
119
120
0.0083
0.0250
0.1083
0.2917
0.6000
0.8167
0.9416
0.9750
0.9916
1
100
within certain int erval
recommended step is tofrequency
R elative
frequency present the proportion or relative frequency.
total number of observations 1313
A
Basic Steps to Form
Frequency Table
step1: determining the number of intervals 5-15
step2: calculating the width of intervals
Step3: forming intervals- certain range of values
Step4: count the number of observation with certain
interval
the final table consists of the intervals and the
frequencies.
14
Frequency
40
30
20
10
0
124
132
140
148
156
164
Figure 2.1 Distribution of heights of 120 boys
from China,1997
15
Present data graphically
presenting data
visually
intuitively
easy to read and understand
self-explanatory
stand alone from text
Statistical table and graph are intended to communicate
information, so it should be easy to read and understand.
The shape of the distribution is the characteristic
of the variable.
16
Application
One lead to a research question
concerns unimodal and symmetry of the
distribution
17
Shape of frequency
distribution
Distribution
Unimodal/bimodal
Symmetry /skew
18
18
Unimodal/bimodal
Homogeneous
/heterogeneous
The definition of
population or the
classification is
approapriate.
19
SYMMETRY & SKEWNESS
Symmetric means the distribution has the same
shape on both side of the peak location.
Skewness means the lack of symmetry in a
probability distribution.
(The Cambridge Dictionary of Statistics in the Medical
Sciences.)
An asymmetric distribution is called skew.
(Armitage: Statistical Methods in Medical Research.)
20
Figure 2.2 Symmetric And
Asymmetric Distribution
negative skewness
positive skewness
21
Positive & Negative
Skewness
A distribution is said to have positive
skewness when it has a long thin tail at the
right, and to have negative skewness when it
has a long thin tail to the left.
A distribution which the upper tail is longer
than the low, would be called positively skew.
22
Frequency
70
60
50
40
30
20
10
0
1
3
5
7
9
11
13
15
17
19
21
Hg (umol/kg)
Fig. The distribution of Hg (hydrargyrum)
of 237 adults hair
23
Frequency
400
300
200
100
0
0
10
20
30
40
50
60
70
80
90
100
QOL
Fig. The distribution of scores of QOL (quality of life )
of 892 senior citizen
24
Frequency
40
30
20
10
0
1
5
10
15
20
25
30
35
40
45
Survival time (month)
Fig. The distribution of survival times for 102
malignant melanoma patients(恶性黑素瘤)
25
Frequency
2500
2000
1500
1000
500
0
0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
Age at death (year)
Fig. The distribution of ages at death
of males in 1990~1992
26
Numerical methods
Central tendency
Tendency of
dispersion
arithmetic mean, Median,
geometric mean
range, interquartile
range, standard
deviation, variance,
coefficient of variation,
27
Mean
Concept and notation
Calculation
Application
28
CONCEPT OF MEAN
Arithmetic mean, mean
Population mean μ
The Sample mean will be denoted by x (‘‘xbar’’).
29
CALCULATION OF MEAN
given a data set of size n
{x1,x2,…,xn},
The mean is computed by summing all the x’s
and divided the sum by n. symbolically
x
x
n
30
GROUPED DATA
The mean can be approximated using the formula
fm
x
n
Where f denotes the frequency ,m the interval
midpoint ,and the summation is across the intervals.
31
Midpoint
The midpoint for an interval
is obtained by calculating
the average of the interval
lower true boundary and the
upper true boundary.
The midpoint for the first
interval is
The midpoint for the second
interval is
124~
128~
132~
124 128
126
2
128 132
130
2
32
Example 1
Class Interval
for Height (cm)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
Frequency
(f)
1
2
10
22
37
26
15
4
2
1
120
m
fm
126
130
134
138
142
146
150
154
158
162
126
260
1340
316
162
33
Average: Limitation in
describing data
It has been said that a fellow with
one leg frozen in ice and the other leg
in boiling water is comfortable
ON AVERAGE !
34
Geometric Mean-notation
The geometric mean is defined as the nth
root of the product of n numbers, i.e., for a
set of numbers.
G /GM
35
Geometric Meancalculation
As the definition, the
expression is
G n x1 x2 ...xn
Example like, the G for 2, 4, 8(n=3) should be
like:
G 2 48 4
3
36
Geometric mean:
G n x1 x2 ...xn
X ln x
1 n
ln G ln xi
n i 1
1
G ln G e
ln G
37
37
Geometric Mean-calculation
Example1_geo given a data set consisting of survival
times to relapse in weeks of 21 acute leukemia patients
that received some drug.
1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23(n=21)
The mean is 8.67 weeks
G n x1 x2 ...xn
X ln x ln G
n
1
ln xi
n i 1
G ln 1 G e ln G
X ln x
ln x
ln G
1.826
n
G e1.826 6.02
38
38
Ex. Serum HI antibody dilution from 107
testees after measles vaccination
f i lg X i
1 165.2654
G lg
35.04
lg
n
107
1
hemagglutination inhibition(HI)
39
application
positive skew data_if log transformation
creates symmetric, unimodal
Geometric series.
40
Median
Concept of median
Calculation
Application-disadvantage $
advantage
41
Concept of Median
If the data are arranged in increasing or
decreasing order, the median is the middle value,
which divided the set into equal halves.
M sample median
cch
17
19
31
39
48
56
68
73
73
75
80
rank
1
2
3
4
5
6
7
8
9
10
11
42
Calculation-how do we get
it?
Example1 n=11
cch
17
19
31
39
48
56
68
73
73
75
80
rank
1
2
3
4
5
6
7
8
9
10
11
M=56
a. When n is odd,
M X n 1
2
43
Calculation-how do we get
it?
Example2 n=12
cch
17
19
31
39
48
56
68
73
73
75
80
122
rank
1
2
3
4
5
6
7
8
9
10
11
12
M=(56+58)/2=57
b. When n is even
1
M (X n X n )
1
2 2
2
44
Application-Advantage
It is robust to the extreme value.
cch
17
19 31 39 48 56 68 73 73
75 80 1220
rank
1
2
10 11
3
4
5
6
7
Mean=58.42
8
9
12
mean=149.9
Median=57
45
Application-when is it used?
Fig.A skew
distribution.
46
Data described by Median
Skew data
Normal distribution data
Ordinal data!!
47
For normal distribution
48
Figure 3 the average of height of basketball players.
49
Disadvantage of median
the precise magnitude of most of the observations are
not taken.
if two groups of observations are pooled, the median of
the combined group cannot be expressed in terms of
the medians of the two component.
n1 X 1 n2 X 2
X
n1 n2
50
Summary: Choosing the
most appropriate
measure
symmetric, unimodal-mean
if log transformation creates symmetric,
unimodal-geometric mean
distribution free, uncertain datamedian
Outlier or skewed data-median
Ordinal data-median
51
Measure of Dispersion
range,
interquartile range,
Variance& standard deviation,
coefficient of variation
52
Percentile(quantile)
X%
PX
(100-X)%
Quartiles:
Lower (First) quartile:
Second quartile:
Upper (Third) quartile:
25%
(QL) p25
median
75%
(QU)p75
53
Measures Of Dispersion
Group A
26
Group B
24
Group C
24
28
27
30
30
29 30 31
32
34
33
36
34
54
Range
& Inter-quartile Range
R = xmax-xmin
QU - QL = P75 -0 %P 252 5 %
P0
Ql
P2 5
8 .5 1
50%
M
P5 0
75%
Qu
P7 5
19 .4 5
10 0 %
P 10 0
Obviously, range and inter-quartile are simple and easy to explain.
However, there are a few difficulties about use of the range.
1.The first is that the value of the range is determined by only two of the
original observations.
2.Second, the interpretation of the range depends on the number of
observations in a complicated way, which is a undesirable feature.
55
variance s2
An alternative approach is to make use of deviations from the
mean, x-xbar; the greater the variation in the data set, the
larger the magnitude of these deviations will tend to be.
From this deviation, the variance s2 is computed by squaring
each deviation, adding them and dividing their sum by one less
than n.
s2
X X
2
n1
n-1: degree of freedom, df
56
Variance
A population variance is denoted by σ2,
2
X
2
N
A sample variance is denoted by s2,
s
2
X X
2
n1
57
57
The following should be noted
It would be no use to take
the mean of deviations
because
Taking the mean of the
absolute values, for
example, is possibility.
However, this measure
has the drawback of being
difficult to handle
mathematically.
(x x) 0
xx
n
58
standard deviation, SD
The variance s2 have the units that are the square
of the original units. For example , if x is the time
in seconds, the variance is measured in seconds
squared(sec2). So it is convenient to have a
measure of variation expressed in the same units
as the original data, and this can be done by
taking the square root of the variance. This
quantity is the standard deviation,
s
X X
2
n1
59
Formula for Calculation
In general the calculation using mean is likely to
cause some trouble. If the mean is not a round
number, say mean is 10/3, it will need to be
rounded off, and errors arise in the subtraction of
this figure from each x. this difficulty can be
overcome by using the following shortcut formula
for the variance or SD.
X X
2
2
s
/n
n 1
60
Solution to calculation of s
x
x
i
2
i
s
746.1
50689.33
2
2
x
(
x
)
i i /n
n 1
50689.33 (746.1) 2 / 11
10
2.89
61
Example:
Group A
26
Group B
24
Group C
24
28
30
27
32
30
29
30
sd
34
33
31
36
34
range
variance
mean
Group A:
8
10.0
3.16
30
Group B:
12
22.5
4.74
30
Group C:
8
8.5
2.92
30
62
Coefficient Of Variation, CV
s
CV 100%
X
nonzero mean.
Make comparison between different distributions.
for variables with different scale or unit;
for variables with more different means.
63
Example:
Comparing The Dispersion Of Two
Variables
mean
sd
Height:
166.06(cm)
4.95(cm)
Weight:
53.72(kg)
4.96(kg)
height :
weight :
4.95
CV
100% 2.98%
166.06
4.96
CV
100% 9.23%
53.72
64
What do the variance and
SD tell us?
Large variance (or SD) means:
more variable, wider range,
lower degree of representativeness of mean.
small variance (or SD) means:
less variable, narrower range,
higher degree of representativeness of mean.
65
Which measure should be
used?
sd, variance
CV
for different units; for more different means.
Range
for unimodal, symmetric,
for any distribution, Wasteful of information.
Interquartile
for any distribution, robust, Wasteful of information.
The subjects should be homogeneity!
66
Summary of Average and
dispersion
Mean±sd(min,max)
Median±interquartile range(min,max)
Using both average and dispersion.
67
SUMMARY
Each variable has its own distribution;
Descriptive
Using graphs
Using statistics
average:
Dispersion:
Mean, G, M
sd, variance, Q, CV, R
Choosing appropriate measurement;
Using average with dispersion.
68
DATA SUMMARIZATION
Tabular and graphical methods
Frequency table
histogram
Numerical methods -Using statistics
measures of location: arithmetic mean, Median
geometric mean,
measures of dispersion:
range, inter-Quartile range(IQR),
standard deviation, variance,
coefficient of variation,
69
70