Download 02 descriptive statistics2015(1)

Document related concepts
no text concepts found
Transcript
Statistical Description (1)
Xiaojin Yu
Introductory biostatistics
http://www.hstathome.com/tjziyuan/Introductory%20Biostatistics%
20Le%20C.T.%20%20(Wiley,%202003)(T)(551s).pdf
Introductory biostatistics for the health science
http://faculty.ksu.edu.sa/hisham/Documents/eBooks/Introductory_
Biostatistics_for_the_Health.pdf
Review


What is Medical statistics about?
key terms in Statistics
2
1.2 key words in Statistics
Population(individual)
& sample
Variation & random variable
Random Variable & data
Statistic & parameter
Sampling error
Probability
3
Framework of statistical analysis
population
Randomly
sampling
sample
individual, variation
representative,
sampling error
parameter
Statistics
unknown
Statistical
inference
based on
probabililty
known
Statistical Description
4
Statistical Description
CONTENTS

For quantitative(numerical) data




Frequency distribution
Measures of central tendency
Measures of dispersion
For qualitative(categorical) data
5
Raw Data (quantitative)
Example: 120 values of height (cm) for 12-year-old boys in 1997:
142.3
134.4
150.3
141.9
143.5
138.1
142.9
140.9
134.7
141.2
135.5
140.2
156.6
148.8
133.1
140.7
139.2
140.2
134.9
141.4
138.5
148.9
144.4
145.4
142.7
137.9
142.7
141.2
144.7
137.4
143.6
160.9
138.9
154.0
143.4
142.4
145.7
151.3
143.9
141.5
139.3
145.1
142.3
154.2
137.7
147.7
137.4
148.9
138.2
140.8
151.1
148.8
141.9
145.8
125.9
137.9
138.5
152.3
143.6
146.7
141.6
149.8
144.0
140.1
147.8
147.9
132.7
139.9
139.6
146.6
150.0
139.2
142.5
145.2
145.4
150.6
140.5
150.8
152.9
149.7
143.5
132.1
143.3
139.6
130.5
141.8
146.2
139.5
138.9
144.5
147.9
147.5
142.9
145.9
146.5
142.4
134.5
146.8
143.3
146.4
134.7
137.1
141.8
136.9
129.4
146.7
149.0
138.7
148.8
135.1
156.3
143.8
147.3
147.1
141.4
148.1
142.5
144.0
142.1
139.9
6
Data Summary
For continuous variable data




Numerical methods
Description of tendency of central
Description of dispersion
Tabular and graphical methods
7
Tabular & Graphical Methods


Frequency table
Histogram
8
FREQUENCY TABLE
Class Interval
for Height (cm)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
Frequency
(f)
1
2
10
22
37
26
15
4
2
1
120
Relative
Frequency
0.0083
0.0167
0.0833
0.1834
0.3083
0.2167
0.1250
0.0333
0.0167
0.0083
1.0000
9
SOLUTION TO EXAMPLE





1.number of intervals
k=10
2 calculate the width
R=Xmax-Xmin= 160.9- 125.9=35
w=R/k
W=35/10=3.5
3.form the intervals
4.counting frequency
A recommended step is to present the
proportion or relative frequency.
10
Class intervals
Class Interval
for Height (cm)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
11
Tally and Counting
Class Interval
for Height (cm)
Frequency
(f)
124~
128~
1
2
10
22
37
26
15
4
2
1
120
132~
136~
140~
144~
148~
152~
156~
160~
Total
Tally mark
一
T
正正
正正正正 T
正正正正正正正T
正正正正正一
正正正
T
一
12
12
Final Frequency Table
Class Interval
for Height
(cm)
Frequency
(f)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
1
2
10
22
37
26
15
4
2
1
120
Relative
Frequency
Cumulative
frequency
Cumulative
rela. freq(%)
0.0083
0.0167
0.0833
0.1834
0.3083
0.2167
0.1250
0.0333
0.0167
0.0083
1.0000
1
3
13
35
72
98
113
117
119
120
0.0083
0.0250
0.1083
0.2917
0.6000
0.8167
0.9416
0.9750
0.9916
1
100
within certain int erval
recommended step is tofrequency
R elative
frequency  present the proportion or relative frequency.
total number of observations 1313
A
Basic Steps to Form
Frequency Table

step1: determining the number of intervals 5-15

step2: calculating the width of intervals

Step3: forming intervals- certain range of values

Step4: count the number of observation with certain
interval

the final table consists of the intervals and the
frequencies.
14
Frequency
40
30
20
10
0
124
132
140
148
156
164
Figure 2.1 Distribution of heights of 120 boys
from China,1997
15
Present data graphically
presenting data





visually
intuitively
easy to read and understand
self-explanatory
stand alone from text
Statistical table and graph are intended to communicate
information, so it should be easy to read and understand.
The shape of the distribution is the characteristic
of the variable.
16
Application


One lead to a research question
concerns unimodal and symmetry of the
distribution
17
Shape of frequency
distribution
Distribution

Unimodal/bimodal

Symmetry /skew
18
18
Unimodal/bimodal


Homogeneous
/heterogeneous
The definition of
population or the
classification is
approapriate.
19
SYMMETRY & SKEWNESS


Symmetric means the distribution has the same
shape on both side of the peak location.
Skewness means the lack of symmetry in a
probability distribution.
(The Cambridge Dictionary of Statistics in the Medical
Sciences.)

An asymmetric distribution is called skew.
(Armitage: Statistical Methods in Medical Research.)
20
Figure 2.2 Symmetric And
Asymmetric Distribution
negative skewness
positive skewness
21
Positive & Negative
Skewness


A distribution is said to have positive
skewness when it has a long thin tail at the
right, and to have negative skewness when it
has a long thin tail to the left.
A distribution which the upper tail is longer
than the low, would be called positively skew.
22
Frequency
70
60
50
40
30
20
10
0
1
3
5
7
9
11
13
15
17
19
21
Hg (umol/kg)
Fig. The distribution of Hg (hydrargyrum)
of 237 adults hair
23
Frequency
400
300
200
100
0
0
10
20
30
40
50
60
70
80
90
100
QOL
Fig. The distribution of scores of QOL (quality of life )
of 892 senior citizen
24
Frequency
40
30
20
10
0
1
5
10
15
20
25
30
35
40
45
Survival time (month)
Fig. The distribution of survival times for 102
malignant melanoma patients(恶性黑素瘤)
25
Frequency
2500
2000
1500
1000
500
0
0
5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
Age at death (year)
Fig. The distribution of ages at death
of males in 1990~1992
26
Numerical methods


Central tendency
Tendency of
dispersion


arithmetic mean, Median,
geometric mean
range, interquartile
range, standard
deviation, variance,
coefficient of variation,
27
Mean



Concept and notation
Calculation
Application
28
CONCEPT OF MEAN



Arithmetic mean, mean
Population mean μ
The Sample mean will be denoted by x (‘‘xbar’’).
29
CALCULATION OF MEAN


given a data set of size n
{x1,x2,…,xn},
The mean is computed by summing all the x’s
and divided the sum by n. symbolically
x

x
n
30
GROUPED DATA

The mean can be approximated using the formula
fm

x
n

Where f denotes the frequency ,m the interval
midpoint ,and the summation is across the intervals.
31
Midpoint



The midpoint for an interval
is obtained by calculating
the average of the interval
lower true boundary and the
upper true boundary.
The midpoint for the first
interval is
The midpoint for the second
interval is
124~
128~
132~
124  128
 126
2
128  132
 130
2
32
Example 1
Class Interval
for Height (cm)
124~
128~
132~
136~
140~
144~
148~
152~
156~
160~
Total
Frequency
(f)
1
2
10
22
37
26
15
4
2
1
120
m
fm
126
130
134
138
142
146
150
154
158
162
126
260
1340
316
162
33
Average: Limitation in
describing data
It has been said that a fellow with
one leg frozen in ice and the other leg
in boiling water is comfortable
ON AVERAGE !
34
Geometric Mean-notation


The geometric mean is defined as the nth
root of the product of n numbers, i.e., for a
set of numbers.
G /GM
35
Geometric Meancalculation

As the definition, the
expression is
G  n x1  x2  ...xn
 Example like, the G for 2, 4, 8(n=3) should be
like:
G  2 48  4
3
36
Geometric mean:
G  n x1  x2  ...xn
X ln x
1 n
 ln G   ln xi
n i 1
1
 G  ln G  e
ln G
37
37
Geometric Mean-calculation



Example1_geo given a data set consisting of survival
times to relapse in weeks of 21 acute leukemia patients
that received some drug.
1,1,2,2,3,4,4,5,5,8,8,8,8,11,11,12,12,15,17,22,23(n=21)
The mean is 8.67 weeks
G  n x1  x2  ...xn
X ln x  ln G 
n
1
ln xi

n i 1
 G  ln 1 G  e ln G
X ln x
ln x

 ln G 
 1.826
n
G  e1.826  6.02
38
38
Ex. Serum HI antibody dilution from 107
testees after measles vaccination
  f i lg X i 
1 165.2654
G  lg 
 35.04
  lg
n
107


1
hemagglutination inhibition(HI)
39
application


positive skew data_if log transformation
creates symmetric, unimodal
Geometric series.
40
Median



Concept of median
Calculation
Application-disadvantage $
advantage
41
Concept of Median

If the data are arranged in increasing or
decreasing order, the median is the middle value,
which divided the set into equal halves.
M sample median
cch
17
19
31
39
48
56
68
73
73
75
80
rank
1
2
3
4
5
6
7
8
9
10
11
42
Calculation-how do we get
it?
Example1 n=11
cch
17
19
31
39
48
56
68
73
73
75
80
rank
1
2
3
4
5
6
7
8
9
10
11
M=56
a. When n is odd,
M  X n 1
2
43
Calculation-how do we get
it?

Example2 n=12
cch
17
19
31
39
48
56
68
73
73
75
80
122
rank
1
2
3
4
5
6
7
8
9
10
11
12
M=(56+58)/2=57
b. When n is even
1
M  (X n  X n )
1
2 2
2
44
Application-Advantage
It is robust to the extreme value.
cch
17
19 31 39 48 56 68 73 73
75 80 1220
rank
1
2
10 11
3
4
5
6
7
Mean=58.42
8
9
12
mean=149.9
Median=57
45
Application-when is it used?
Fig.A skew
distribution.
46
Data described by Median
Skew data
 Normal distribution data
 Ordinal data!!

47
For normal distribution
48
Figure 3 the average of height of basketball players.
49
Disadvantage of median


the precise magnitude of most of the observations are
not taken.
if two groups of observations are pooled, the median of
the combined group cannot be expressed in terms of
the medians of the two component.
n1 X 1  n2 X 2
X 
n1  n2
50
Summary: Choosing the
most appropriate
measure



symmetric, unimodal-mean
if log transformation creates symmetric,
unimodal-geometric mean
distribution free, uncertain datamedian
Outlier or skewed data-median
Ordinal data-median
51
Measure of Dispersion
range,
interquartile range,
Variance& standard deviation,
coefficient of variation
52
Percentile(quantile)
X%

PX
(100-X)%
Quartiles:

Lower (First) quartile:

Second quartile:

Upper (Third) quartile:
25%
(QL) p25
median
75%
(QU)p75
53
Measures Of Dispersion
Group A
26
Group B
24
Group C
24
28
27
30
30
29 30 31
32
34
33
36
34
54
Range
& Inter-quartile Range
R = xmax-xmin
QU - QL = P75 -0 %P 252 5 %
P0
Ql
P2 5
8 .5 1
50%
M
P5 0
75%
Qu
P7 5
19 .4 5
10 0 %
P 10 0
Obviously, range and inter-quartile are simple and easy to explain.
However, there are a few difficulties about use of the range.
1.The first is that the value of the range is determined by only two of the
original observations.
2.Second, the interpretation of the range depends on the number of
observations in a complicated way, which is a undesirable feature.
55
variance s2
 An alternative approach is to make use of deviations from the
mean, x-xbar; the greater the variation in the data set, the
larger the magnitude of these deviations will tend to be.
 From this deviation, the variance s2 is computed by squaring
each deviation, adding them and dividing their sum by one less
than n.
s2
X  X



2
n1
n-1: degree of freedom, df
56
Variance

A population variance is denoted by σ2,


2
 X  


2
N
A sample variance is denoted by s2,
s
2
X  X



2
n1
57
57
The following should be noted


It would be no use to take
the mean of deviations
because
Taking the mean of the
absolute values, for
example, is possibility.
However, this measure
has the drawback of being
difficult to handle
mathematically.
 (x  x)  0
 xx
n
58
standard deviation, SD

The variance s2 have the units that are the square
of the original units. For example , if x is the time
in seconds, the variance is measured in seconds
squared(sec2). So it is convenient to have a
measure of variation expressed in the same units
as the original data, and this can be done by
taking the square root of the variance. This
quantity is the standard deviation,
s
 X  X 
2
n1
59
Formula for Calculation

In general the calculation using mean is likely to
cause some trouble. If the mean is not a round
number, say mean is 10/3, it will need to be
rounded off, and errors arise in the subtraction of
this figure from each x. this difficulty can be
overcome by using the following shortcut formula
for the variance or SD.
 X   X 
2
2
s
/n
n 1
60
Solution to calculation of s
x
x
i
2
i
s
746.1
 50689.33
2
2
x

(
x
)
 i  i /n
n 1
50689.33  (746.1) 2 / 11

10
 2.89
61
Example:
Group A
26
Group B
24
Group C
24
28
30
27
32
30
29
30
sd
34
33
31
36
34
range
variance
mean
Group A:
8
10.0
3.16
30
Group B:
12
22.5
4.74
30
Group C:
8
8.5
2.92
30
62
Coefficient Of Variation, CV
s
CV   100%
X
 nonzero mean.
 Make comparison between different distributions.
 for variables with different scale or unit;
 for variables with more different means.
63
Example:
Comparing The Dispersion Of Two
Variables
mean
sd
Height:
166.06(cm)
4.95(cm)
Weight:
53.72(kg)
4.96(kg)
height :
weight :
4.95
CV 
 100%  2.98%
166.06
4.96
CV 
 100%  9.23%
53.72
64
What do the variance and
SD tell us?

Large variance (or SD) means:
 more variable, wider range,
 lower degree of representativeness of mean.

small variance (or SD) means:
 less variable, narrower range,
 higher degree of representativeness of mean.
65
Which measure should be
used?

sd, variance


CV


for different units; for more different means.
Range


for unimodal, symmetric,
for any distribution, Wasteful of information.
Interquartile

for any distribution, robust, Wasteful of information.
The subjects should be homogeneity!
66
Summary of Average and
dispersion

Mean±sd(min,max)
Median±interquartile range(min,max)

Using both average and dispersion.

67
SUMMARY


Each variable has its own distribution;
Descriptive


Using graphs
Using statistics




average:
Dispersion:
Mean, G, M
sd, variance, Q, CV, R
Choosing appropriate measurement;
Using average with dispersion.
68
DATA SUMMARIZATION




Tabular and graphical methods
Frequency table
histogram
Numerical methods -Using statistics
measures of location: arithmetic mean, Median
geometric mean,
measures of dispersion:
range, inter-Quartile range(IQR),
standard deviation, variance,
coefficient of variation,
69
70
Related documents