Download - MediPIET

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Time series wikipedia , lookup

Transcript
Measures of central
tendency and
dispersion
Tunis, 28th October 2014
Dr Ghada Abou Mrad
Ministry of Public Health, Lebanon
[email protected]
Learning objectives
• Define the different types of variables and
data within a population or a sample
• Describe data using the common measures of
central tendency (Mode, Median, arithmetic
Mean)
• Describe data in terms of their measures of
dispersion (range, standard
deviation/variance, standard error)
Variable
• A population is any complete group of units (such as
person or business) with at least one characteristic in
common. It needs to be clearly identified at the
beginning of a study.
• A sample is a subset group of units in a population,
selected to represent all units in a population of
interest
• A variable is any characteristics, number, or quantity
that can be measured or counted. It is called a
variable because its value may vary in the population
and over time; it is represented by “X” in a
population and “x” in a sample
Data
• Data are the measurements or observations or
values that are collected for a specific variable in a
population or a sample; an observation can be
represented by “Xi “ in a population and “xi“ in a
sample
– A data unit (or unit record or record) is one entity (such as a person or
business) in the population being studied, for which data are collected.
– A data item (or variable) is a characteristic (or attribute) of a data unit
which is measured or counted, such as height.
Data item
Data unit
#
1
2
3
age
20
16
23
sex
M
F
F
height
175
163
170
Obs
Dataset
• A dataset is a complete collection of
all observations for a specific variable
in a population or a sample; it is called
a raw dataset if the data have not
been organized; the total number of
observation in a dataset can be
represented by “N” for a population
and “n” for a sample
• Example: Ages of students in a class
(years)
Age
27
30
28
31
28
36
29
37
29
34
30
30
27
30
28
31
32
30
29
29
Types of variables
Variable
Qualitative
nominal
Quantitative
ordinal
discrete
continuous
Types of variables
• Qualitative variable: have value that describe a
'quality' ; it is also called a categorical variable
– Nominal: Observations can take a value that is not
able to be organized in a logical sequence like sex or
eye color
– Ordinal: Observations can take a value that can be
logically ordered from lowest to highest like clothing
size (i.e. small, medium, large)
• The data collected for a qualitative variable are
qualitative data
Types of variables
• Quantitative variable: have values that describe
a measurable quantity ; it is also called numeric
variable; it can be ordered from lowest to highest
– Discrete: Observations can take a value based on a
count from a set of values. It cannot take the value of
a fraction between one value and the next closest
value. Ex: number of children in a family
– Continuous: Observations can take any value
between a certain set of real numbers. Ex: height
• The data collected for a quantitative variable are
quantitative data
Descriptive statistics
Statistics describe or summarize data
• Most data can be ordered from lowest to highest
• The frequency is the number of times an
observation occurs for a variable; the frequency
distribution can be shown in a table or in a graph
such as histogram
• Quantitative data can be described using the
common measures of central tendency (Mode,
Median, Mean) and the measures of dispersion
(range, standard deviation/variance, standard
error)
Obs
Age
1
27
2
27
3
28
4
28
Age
Frequency
5
28
27
2
6
29
7
29
28
3
8
29
29
4
9
29
30
5
10
30
31
2
11
30
12
30
32
1
13
30
33
0
14
30
34
1
15
31
35
0
16
31
17
32
36
1
18
34
37
1
19
36
Total
20
20
37
Frequency distribution
Obs
Age
1
27
2
27
3
28
4
28
5
28
6
29
7
29
8
29
9
29
10
30
6
11
30
5
12
30
4
13
30
14
30
15
31
2
16
31
1
17
32
18
34
19
36
20
37
Histogram
7
3
27 28
29
30
31
32 33
34
35
36 37
Histogram - Outliers
Outliers are extreme, or atypical data value(s) that are notably different from
the rest of the data.
Number of patients
6
5
4
3
2
1
0
0
5
10
15
20
25
30
Nights of stay
35
40
45
50
Epidemic curve
Central Location
?
Number of people
20
?
15
10
5
Spread
0
0-9
10-19
20-29
30-39
40-49
50-59
Age
60-69
70-79
80-89
90-99
Measures of central tendency and spread
Central Location / Position / Tendency
A single value that is a good summary of an
entire distribution of data
Spread / Dispersion / Variability
How much the distribution is spread or
dispersed from its central location
Measure of Central Tendency
 Also known as measure of central position or
location
 It is a single value that summarizes an entire
distribution of data
 Common measures
– Mode
– Median
– Arithmetic mean
Mode
Mode is the value that occurs most frequently
Method for identification
1. Arrange data into frequency distribution or
histogram, showing the values of the variable and
the frequency with which each value occurs
2. Identify the value that occurs most often
Obs
Age
1
27
2
27
3
28
4
28
5
28
Age
Frequency
6
29
7
29
27
2
8
29
28
3
9
29
29
4
10
30
30
5
11
30
31
2
12
30
13
30
32
1
14
30
33
0
15
31
34
1
16
31
35
0
17
32
18
34
36
1
19
36
37
1
20
37
Total
20
Mode
Obs
Age
1
27
2
27
3
28
4
28
5
28
6
29
7
29
8
29
9
29
10
30
6
11
30
5
12
30
4
13
30
14
30
15
31
2
16
31
1
17
32
18
34
19
36
20
37
Mode
Mode = 30
7
3
27 28
29
30
31
32 33
34
35
36 37
20
Unimodal Distribution
18
Population
16
14
12
10
8
6
4
2
0
18
16
Population
14
12
10
8
6
4
2
0
Bimodal Distribution
Mode – Properties / Uses
•
•
•
•
•
•
Easiest measure to understand, explain, identify
Always equals an original value
Does not use all the data
Insensitive to extreme values (outliers)
May be more than one mode
May be no mode
Median
Median is the middle value; it splits the
distribution into two equal parts
– 50% of observations are below the median
– 50% of observations are above the median
Method for identification
1. Arrange observations in order
2. Find middle position as (n + 1) / 2
3. Identify the value in the middle
Obs
Age
1
27
2
27
3
28
4
28
5
28
6
29
7
29
8
29
9
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
Median: uneven
number of values
n = 19
n+1
Median
=
2
Observation
19+1
= 2
= 20
2
= 10
Median age = 30 years
Obs
Age
1
27
2
27
3
28
4
28
5
28
6
29
7
29
8
29
9
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
Median: even
number of values
n = 20
n+1
Median
=
2
Observation
20+1
= 2
= 21
2
= 10.5
Median age = Average value
between 10th and 11th observation
30+30
30 years
=
2
Median – Properties / Uses
• Does not use all the data available
• Insensitive to extreme values (outliers)
• Measure of choice for skewed data
Arithmetic Mean
Arithmetic mean = “average” value = m
Method for identification
1. Sum up (S) all of the values (xi)
2. Divide the sum by the number of
observations (n)
Obs
Age
1
27
2
27
3
28
4
28
5
28
6
29
7
29
8
29
9
29
10
30
11
30
12
30
13
30
14
30
15
31
16
31
17
32
18
34
19
36
20
37
Arithmetic Mean
x i
m=
n
n = 20
Sxi = 605
605
m=
20
= 30.25
Since the mean uses all data,
is sensitive to outliers
6
5
4
3
2
1
0
Mean = 12.0
0
5
10
15
20
25
Nights of stay
30
35
40
45
50
Number of patients
6
Mean = 15.3
5
4
3
2
1
0
0
10
20
30
40
50
60
70
80
Nights of stay
90
100
110
120
130
140
150
When to use the arithmetic mean?
 Centered distribution
 Approximately symmetrical
 Few extreme values (outliers)
When to use the arithmetic mean? (ii)
2
1
OK!
3
4
Arithmetic Mean – Properties / Uses
•
•
•
•
Use all of the data
Affected by extreme values (outliers)
Best for normally distributed data
Not usually equal to one of the original values
How does the shape of a distribution influence
the Measures of Central Tendency?
Symmetrical:
Mode = Median = Mean
Skewed right:
Mode < Median < Mean
Skewed left:
Mean < Median < Mode
Epidemic curve
Central Location
?
Number of people
20
?
15
10
5
Spread
0
0-9
10-19
20-29
30-39
40-49
50-59
Age
60-69
70-79
80-89
90-99
Same center
but …
different dispersions
Measures of Spread
Measures that quantify the variation or dispersion
of a set of data from its central location
•
•
Also known as “Measure of dispersion/ variation”
Common measures
• Range
• Variance / standard deviation
• Standard error
Range
Range = Difference between largest and smallest
values in a dataset
Properties / Uses:
– Greatly affected by outliers
– Usually used with median
Finding the Range of
Length of Stay Data
0, 2, 3, 4, 5, 5, 6, 7, 8, 9,
9, 9, 10, 10, 10, 10, 10, 11, 12, 12,
12, 13, 14, 16, 18, 18, 19, 22, 27, 49
6
5
4
3
2
1
0
0
5
10
15
20
25
Nights of stay
30
35
40
45
50
Range – Sensitive to Outliers?
6
5
4
3
2
1
0
Range = 49 - 0 = 49
0
5
10
15
20
25
Nights of stay
30
35
40
45
50
Number of patients
6
Range = 149 - 0 = 149
5
4
3
2
1
0
0
10
20
30
40
50
60
70
80
90
Nights of stay
100 110 120 130 140 150
Variance and Standard Deviation
Measures of variation that quantifies how closely
clustered the observed values are to the mean;
measures of the spread of the data around the mean
Variance
= average of squared deviations from mean
= Sum (each value – mean)2 / (n-1)
Standard deviation
= square root of variance
Variance and Standard Deviation (ii)
s² =
 (x i - x ) ²
n-1
s =
( x i - x )²
n-1
x : mean
xi : value
n : number
s²: variance
s : standard deviation
Steps to Calculate Variance and
Standard Deviation
x : mean
xi : value
n : number
s²: variance
s : standard deviation
s² =
x
1.
Calculate the arithmetic mean
2.
Subtract the mean from each observation.
3.
Square the difference.
4.
( x i - x )²
n-1
x i- x
( x i - x )²
Sum the squared differences
( x i - x )²
5.
Divide the sum of the squared differences by n – 1
6.
Take the square root of the variance
s = s2
Length of Stay Data
(0 – 12)2 = 144
(2 – 12)2 = 100
(3 – 12)2 = 81
(4 – 12)2 = 64
(5 – 12)2 = 49
(5 – 12)2 = 49
(6 – 12)2 = 36
(7 – 12)2 = 25
(8 – 12)2 = 16
(9 – 12)2 = 9
(9 – 12)2 = 9
(9 – 12)2 = 9
(10 – 12)2 = 4
(10 – 12)2 = 4
(10 – 12)2 = 4
(10 – 12)2 = 4
(10 – 12)2 = 4
(11 – 12)2 = 1
(12 – 12)2 = 0
(12 – 12)2 = 0
(12 – 12)2 =
0
(13 – 12)2 =
1
(14 – 12)2 =
4
(16 – 12)2 = 16
(18 – 12)2 = 36
(18 – 12)2 = 36
(19 – 12)2 = 49
(22 – 12)2 = 100
(27 – 12)2 = 225
(49 – 12)2 = 1369
Sum = 2448; Var = 2448 / 29 = 84.4; SD = 84 = 9.2
Standard Deviation
Standard deviation usually calculated only when data are
more or less normally distributed (bell shaped curve)
For normally distributed data,
• 68.3% of the data fall within plus/minus 1 SD
• 95.5% of the data fall within plus/minus 2 SD
• 95.0% of the data fall within plus/minus 1.96 SD
• 99.7% of the data fall within plus/minus 3 SD
The standard deviation of a normal distribution enables the
calculation of confidence intervals
Normal Distribution
2.5%
95%
2.5%
68%
Standard
deviation
Mean
Properties of Measures of
Central Location and Spread
•
•
•
•
•
•
•
For quantitative / continuous variables
Mode – simple, descriptive, not always useful
Median – best for skewed data
Arithmetic mean – best for normally distributed
data
Range – use with median
Standard deviation – use with mean
Standard error – used to construct confidence
intervals
Name the appropriate
measures of central Location and Spread
Distribution
Single peak,
symmetrical
Skewed or
Data with outliers
Central Location
Spread
Name the appropriate
measures of central Location and Spread
Distribution
Central Location
Spread
Single peak,
symmetrical
Mean*
Standard
deviation
Skewed or
Data with outliers
Median
Range or
Interquartile range
* Median and mode will be similar
Any questions?
Median
Mode
14
12
Population
10
8
6
4
2
0
Age
1st quartile
Minimum
3rd quartile
Interquartile interval
Range
Maximum
Thank you!
Dr Ghada Abou Mrad
Ministry of Public Health, Lebanon
[email protected]