Download Data

Document related concepts
no text concepts found
Transcript
Chapter 2
Descriptive statistics for
quantitative data
定量资料的描述性统计分析
review
Types of data
 Numerical data:
--- continuous
--- discrete
 Categorical data:
--- nominal
--- ordinal
review
Statistics :
It is a branch of applied mathematics
that refers to the collection and
interpretation of data, and evaluation of
the reliability of the conclusions based
on the data.
Types of statistical analysis
 Descriptive analysis :
---Data collection
---Data interpretation
 Inferential analysis :
---Evaluate the reliability of the
conclusions
Contents
 Frequency distribution ★
 Central tendency
★
 Dispersion (measures of
variability) ★
 Tables and graphs
New words
• Frequency
频数
• Proportion
比例
• Percentage
百分数
• Histogram
直方图
• Polygon
折线图
• Distribution
分布
• Frequency distribution
频数分布
• Cumulative frequency
累积频数
• Cumulative proportion
累积比例
• Central tendency
集中趋势
• Dispersion
离散程度
• Mean
均数
• Arithmetic mean
算术均数
• Geometric mean
几何均数
• Median
中位数
• Mode
众数
• Skewness
偏度
• Kurtosis
峰度
• Descriptive analysis
描述分析
• Inferential analysis
推断分析
1.
Id sex
1
m
2
m
3
f
4
m
5
f
6
f
7
f
8
m
9
f
10
f
11
m
12
f
13
f
14
f
15
f
Frequency distribution
age
6
8
13
16
16
15
23
19
25
21
13
19
9
10
14
Frequency (频数):
For a given variable, the number
of times a value occurs is called
its frequency.
Frequency table of sex
Sex
m
f
Label
Male
Female
Frequency
5
10
Proportion or percent (比例或百分数):
The ratio of a frequency to total
frequency
Frequency table of sex
Sex Label
Frequency proportion
-------------------------------------------------m
Male
5
33.33
f
Female
10
66.67
-------------------------------------------------Total m+f
15
100.00
Freq distribution of sex
Frequency
Sex Frequency Percentage
distribution:
m
A table or a graph that
f
proportion of these
values occurs
Frequency
with the freq and
33.33
10
66.67
Frequency distribution of sex
list all the distinct values
in a variable together
5
15
10
5
0
male
female
Sex
Method of displaying frequency
distribution of categorical data
1. Nominal data
2. Ordinal data
Freq distribution of nominal data
Freq distribution of sex
Sex Frequency Percentage
m
f
5
33.33
10
66.67
Frequency distribution of sex
Frequency
Id sex eyesight age
1
m
1
6
2
m
2
8
3
f
3
13
4
m
3
16
5
f
4
16
6
f
4
15
7
f
5
23
8
m
6
19
9
f
6
25
10
f
6
21
11 m
7
13
12
f
7
19
13
f
8
9
14
f
9
10
15
f
9
14
15
10
5
0
male
female
Sex
Freq distribution of ordinal data
Freq distribution of eyesight
Eyesight Frequency Percentage
1-3
4
26.67
4-6
6
40.00
Frequency distribution of eyesight
Frequency
Id sex eyesight age
1
m
1
6
2
m
2
8
3
f
3
13
4
m
3
16
5
f
4
16
6
f
4
15
7
f
5
23
8
m
6
19
9
f
6
25
10
f
6
21
11
m
7
13
12
f
7
19
13
f
8
9
14
f
9
10
15
f
9
14
8
6
4
2
0
1-3
4-6
Eyesight
7-9
Method of displaying frequency
distribution of numerical data
• first dividing the whole interval into several unoverlapped subintervals,
• count how many observations lies in each
subinterval to make a frequency table,
• take the midpoint of each subinterval as x-axis label,
draw a histogram(直方图) or a polygon (折线图).
Freq distribution of numerical data
Freq distribution of age
[0-10)
[10-20)
[20-30]
Age midpoint Frequency
0~
5
3
10~
15
9
20~30
25
3
Frequency distribution of age
Frequency
Id sex eyesight age
1
m
1
6
2
m
2
8
3
f
3
13
4
m
3
16
5
f
4
16
6
f
4
15
7
f
5
23
8
m
6
19
9
f
6
25
10
f
6
21
11
m
7
13
12
f
7
19
13
f
8
9
14
f
9
10
15
f
9
14
10
5
0
5
15
Age
25
Histogram and polygon
10
Frequency polygon for age
Frequency
Frequency
Frequency distribution of age
5
0
5
15
Age
Histogram
25
10
5
0
0
5
10
15
Age
20
polygon
25
30
15
Frequency distribution of eyesight
Frequency
Frequency
Frequency distribution of sex
10
5
0
male
8
6
4
2
0
1-3
female
4-6
Eyesight
Sex
Nominal data
Ordinal data
10
Frequency polygon for age
Frequency
Frequency
Frequency distribution of age
5
0
5
15
Age
7-9
10
25
Numerical data
5
0
0
5
10
15
Age
20
25
30
Cumulative frequency and
cumulative proportion
Cumulative frequency (累计频数): sum of total
frequency from low to a certain category
Cumulative proportion (累计比例): sum of
total proportion from low to a certain category
Frequency table of age
Cumulative Cumulative
Age midpoint Frequency Proportion
frequency proportion
0-10
5
3
20.0
3
20.0
10-20
15
9
60.0
12
80.0
20-30
25
3
20.0
15
100.0
The plot of cumulative frequency
and cumulative proportion
The major measures of the characteristics
of observations for a numerical variable
 Central tendency
 Dispersion
(集中趋势)
(离散程度)
Frequency distribution of red blood cells
30
Frequency
25
20
15
10
5
0
420- 440- 460- 480- 500- 520- 540- 560- 580- 600- 620- 640-
Red blood cells
2.
Central tendency
Central tendency(集中趋势):
The description of the concentration near
the middle of the range of all values in a
variable.
The major measures of central
tendency are: mean, median, mode.
The mean
The mean (均数) :
It is a measure of the average level of all
observations in a variable, it is defined as follow:
population mean

1
N
N
X
i 1
i
sample mean
1 n
X   Xi
n i 1
---------Arithmetic mean (算术均数)
Eg1a: Estimate the mean
The data listed below is the content of
haemoglobin (g/L) (血色素), estimate the mean.
Solution:
Data:
id
1
2
3
4
5
6
x
121
118
130
120
122
118
id
7
8
9
10
11
12
x
116
124
127
129
125
132
n=12
1 n
x   Xi
n i 1
= (121+118+…+125+132)/12
= 123.5
So, the estimated mean of the
Haemoglobin is 123.5 g/L.
Another formula for mean
If x has k different values, and fi is the frequency of
i-th value xi occurring in the sample, then the
sample mean can be estimated as follow:
Data:
x
x1
x2
……
xk
Formula:
freq
f1
f2
……
fk
n
k
k
i 1
i 1
X  ( fi xi ) / ( fi )
k
 ( fi xi ) / n
i 1
Eg1b: Estimate the mean
The following data are measured serum
cholesterol (血清胆固醇) from 101 aged 30-49 men.
Estimate the mean.
data:
Serum
Cholest.
2.5 ~
3.5 ~
4.5 ~
5.5 ~
6.5 ~
Solution:
Midpoint
3.0
4.0
5.0
6.0
7.0
Freq.
9
32
42
15
3
101
n=101,
k
k
i 1
i 1
x  ( f i xi ) /(  f i )
=(3×9+4×32+5×42+
6×15+7×3) / 101
= 4.71 (mmol/L)
The median
The median (中位数):
It is a middle measure in an ordered values of all
observations in a variable. It is defined as below:
population median
sample median
M  X ( N 1) / 2
m  x( n 1) / 2
In which,
the
X 1 , X 2 ,, X N
are ordered values in pop,
the
x1 , x2 ,, xn
are ordered values in sample.
The method of estimating the
median:
1) Order all values of observations in a variable
from smaller to larger;
2) If n is odd, find out middle one observation,
this value is the required median;
3) If n is even, find out middle two observations,
the average of this two values is the required
median.
eg, if n=9, then m=x((9+1)/2)=x(5)=x5
if n=10, then m=x((10+1)/2)=x(5.5)=(x5+x6)/2
Eg2a: Estimate the median
The data listed below is the content of
haemoglobin (g/L), estimate the median.
Solution:
Data:
id
1
2
3
4
5
6
x
121
118
130
120
122
118
id
7
8
9
10
11
12
x
116
124
127
129
125
132
The ordering values are:
116,118,118,120,121,122,
124,125,127,129,130,132.
n=12, is even, therefore,
med= (122+124)/2=123
So, the median of the
Haemoglobin is 123 g/L.
Eg2b: Estimate the median
The following data are measured serum
cholesterol (mmol/L) from 101 aged 30-49 men.
estimate the median.
Solution:
Data:
Serum
Cholest.
2.5 ~
3.5 ~
4.5 ~
5.5 ~
6.5 ~
Midpoint Freq.
3.0
9
4.0
32
5.0
42
6.0
15
7.0
3
Since n=101 is odd number,
so the median is middle one
value, that is, the ordering
number is 51, from the data,
the 51th value is 5.0, ie, the
median M=5.0.
More accurate value of M is
4.5+(5.5-4.5) / 42×10=4.74
Frequency distribution about
mean and median
Central tendency of serum cholesterol
Mean=4.71 Median=5.0
Frequency
60
40
20
0
3
4
5
6
Serum Cholesterol
7
Skewed distribution
median
mean
frequency
frequency
100
100
80
80
60
60
40
40
20
20
0
0
0
20
40
60
median
mean
80
100
120
140
positive or right skewed
0
20
40
60
80
100
120
negative or left skewed
140
Comparing mean and median
mean
median
information
more
(actual values)
less
(ranks)
data available
not available
for ordinal data
available for
any data
symmetric
size in magnitude
Mean=median
+ skewed
Mean>median
- skewed
Mean<median
The definition of median
 The median
is a value for which no more
than half the data are smaller than it and
no more than half the data are larger than
it.
 eg, 12,
14, 14, 15, 16, 16, 16, 17, 18.
M=16, for which, four < M and two>M.
The Geometric mean
When distribution of a variable is not symmetry,
or the data has no up or low bound, then the
geometric mean is a best measure for the central
tendency.
Eg3. The following data are 10 patients’ white
blood cell counts(×1000): 11, 9, 35, 5, 9, 8, 3, 10,
12, 8. Estimate the arithmetic mean and
geometric mean.
The mode
The mode (众数):
It is defined as the most frequently occurring
values in a set of data.
• It is a relatively great concentration.
• If a data consists of the values:
6,7,7,8,8,8,8,9,10,11,11,12,12,12,12,13
then the mode is 8 and 12.
Summary
• Frequency distribution
• Histogram & polygon
• Measures of central tendency
• Measures of dispersion
频数
Note: When the width of subinterval are not equal, or
the data no up or low bound, then polygon is more
available than histogram.
25
20
15
10
5
0
0
20
40
60
80 100 120 140 160 180
体重(盎司)
Frequency distribution of
birthweight
New words
• Dispersion
度
离散程
• Range
全距
• Deviation
离均差
• Variance
方差
• Standard deviation
标准差
• Coefficient of variation
数
变异系
New words
• Quartile
四分位数
• Percentile
百分位数
• Inter-quartile interval
距
四分位间
§3. Dispersion
Dispersion (离散程度):
The indication of a spread of
measurements around the center of a
variable distribution
The major measures of dispersion are:
range, variance, standard deviation, interquartile range, coefficient of variation, etc.
The range
The range (全距):
It measures the distributed length of
data.
Population range
Range = max - min
*
#
#
#
Sample range
Range = max - min
It is a simple measure, it has the same unit as the original data.
It use less information (only max & min);
Sample range underestimates the pop range—biased, inefficient
It convey no information about the middle of the distribution.
The quartiles
The first-quartile (第一四分位数) Q1:
It is a value, for which no more than 25% of
observed values are less than it, and no
more than 75% of observed values are
greater than it.
X1
≤25%
M
Xn
≤ 75%
The second-quartile (第二四分位数) Q2=M:
It is a value, for which no more than 50% of
observed values are less than it, and no
more than 50% of observed values are
greater than it.
M
X1
Xn
≤50%
≤ 50%
The third-quartile (第三四分位数) Q3:
It is a value, for which no more than 75% of
observed values are less than it, and no
more than 25% of observed values are
greater than it.
M
X1
≤75%
Xn
≤ 25%
Location of quartiles
≤ 25%
≤ 25%
Q1
≤ 25%
Q2
≤ 25%
Q3
M
X1
≤ 50%
Xn
≤50%
The method of estimate the
quartiles
If the subscript is not an integer or
half-integer,then it is rounded up to a
nearest integer or half-integer.
Eg1: Estimate the quartiles
A
34
36
37
39
40
41
42
43
79
B
34
36
37
39
40
41
42
43
44
45
-------------n=9 n=10
The inter-quartile range (四分位数间距) :
It is a the difference between Q1 and Q3:
Q3-Q1.
Q1
X1
Q3
M
Middle 50%
Xn
Eg2: Estimate the interquartile range
A
34
36
37
39
40
41
42
43
79
B
34
36
37
39
40
41
42
43
44
45
-------------n=9 n=10
Interquartile tange of A=42.5-36.5=6.0
Interquartile tange of A=43.5-37.0=6.5
The percentiles
Theαth percentile (α百分位数 ) Pα :
It is a value,for which no more than α%
of data less than it, and no more than α%
larger than it, where, 0 ≤ α≤100.
• P0=min, p100=max
• P25= Q1, P50= Q2=M, P75= Q3.
The method of estimate the
percentiles
If the subscript is not an integer or halfinteger, then it is rounded up to a nearest
integer or half-integer.
Eg3: Estimate the percentiles
Data:
A
34
36
37
39
40
41
42
43
79
B
34
36
37
39
40
41
42
43
44
45
------------n=9 n=10
 For
data A:
P0=34, P10=34, P20=36, P30=37,
…, P90=79, P100=79.
 For
data B:
P0=34, P10=34, P20=36, P30=37,
…, P90=44, P100=45.
 Note: there are many
ways to
estimate percentiles, the
results are not unique.
The variance
The variance (Var, 方差):
It measures the average dispersion of the
data about the mean.
Population variance
Sample variance
note: degree of freedom are not same: N and n-1.
* It convey information about the middle of the distribution.
* S2 is a unbiased estimate of σ2, they are positive values;
# The unit is not same as the original data.
Simplify formulas of variance
Population variance
Sample variance
Proving of simplify formula
Eg4a:Estimate the variance
Data:
id x
x*x
1
1
1
2
2
4
3
3
9
4
4
16
5
5
25
------------------∑ 15
55
Solution:
Another formula for variance
Data:
x
x1
x2
……
xk
freq
f1
f2
……
fk
n
k
k
Eg4b: Estimate the variance
Data:
id x f f*x f*x*x
1 1 3 3
3
2 2 3 6
12
3 3 2 6
18
4 4 1 4
16
5 5 2 10 50
----------------------------∑ 15 11 29 99
Solution:
The standard deviation
The standard deviation (sd, SD, 标准差):
It measures the average dispersion of the
data about the mean.
Population sd
Sample sd
* It convey information about the mean of the distribution.
* s is an unbiased estimate of σ, they are positive values;
* The unit is the same as the original data.
Eg5: Estimate the SD
Data:
id x
x*x
1
1
1
2
2
4
3
3
9
4
4
16
5
5
25
------------------∑ 15
55
Solution:
The coefficient of variation
The coefficient of variance (cv, CV, 变异系
数): It measures the relative variation about
mean.
Population cv
Sample cv
* It measures a relative variability or relative dispersion.
* Its value does not depends on the unit of variable,
Instead of variance or standard deviation with units.
* It can be used to compare variations with different units
Eg6: Estimate the CV
Data: age
Data: weight
Data: weight
id
1
2
3
4
5
id
1
2
3
4
5
id
1
2
3
4
5
x
1
2
3
4
5
sum: 15
mean: 3
var: 2.5
sd
1.58
cv: 52.70
y
11
12
13
14
15
sum: 65
mean: 13
var: 2.5
sd
1.58
cv: 12.16
Coding effects: (1) +-:
(2) ×÷:
S is unchanged;
CV is unchanged.
y
110
120
130
140
150
sum: 650
mean: 130
var: 250
sd
15.8
cv: 12.16
Summary
1. Measures of central tendency:
mean, median, mode.
2. Masures of dispersion:
variance, standard deviation,
range, inter-quartile, CV.
Related documents