Download Biostatistics 1

Document related concepts
no text concepts found
Transcript
Biostatistics I
Descriptive statistics and some things related
to the normal distribution
Outline
2

Descriptive statistics




Frequency distribution, relative, cumulative, histograms
Measure of central tendency (mean, median, mode)
Deviations and measure of variation
Some thoughts on the normal distribution

Z-score, CV, Confidence interval

Standard error of the mean

How many samples?
Population and sample 1
3

Population:

A finite number of separate
objects defined in space and
time




All boats operating in a
country’s EEZ in year 2009
Boat of a particular type
operating in a country’s water
in January 2009
The number of queen conch
in a country’s EEZ
Sample:

A subset of a population

Usually a sample is an order
of magnitude smaller than the
size of a population.
Population and sample 2
4

Use information from a sample to make inference about
the population
Population is unknown
Sample is known
Inference
Can only make inference about the population from the sample if the
sample is representative of the population
Frequency distributions
Frequency distributions 1
6


Objectives of frequency
tabulation is to condense the
raw data into some more
useful form that allows some
visual interpretation of the
data.
How can we make a quick
summary of the data on the
right?


Lets say that the data contain
length measurements of 30
fishes (n=30)
We can quickly see that the
smallest fish is 3.4 cm and
that the largest is 15.3 cm
Measurement Length of
number (i)
fish i (cm)
1
13.1
2
11.7
3
9.0
4
7.0
5
9.9
6
5.1
7
11.6
8
6.4
9
8.0
10
8.7
11
13.0
12
11.6
13
8.7
14
12.8
15
7.5
16
12.1
17
10.8
18
11.5
19
10.3
20
3.4
21
8.1
22
9.4
23
5.6
24
12.6
25
12.4
26
3.4
27
4.1
28
15.3
29
7.3
30
10.8
Frequency distributions 2
7

How its done:

Decide on the number of classes
to include in the frequency
distribution.


Find the class width: determine the
range of the data, divide the range
by the number of classes and round
up to the next convenient number.



Range is: 15.3 cm – 3.4 cm = 11.9
11.9 cm / 7 = 1.7 cm  2 cm
Find the class limits: Start with the
lowest value (rounded down) as the
lower limit of the first class, add the
class width to this to obtain the
lower limit for the second class, etc.



Here 7 length classes
Lowest class limit = 2 cm
Next one: 2 + 2 = 4 cm, etc.
Count the number of fish in each
length class, either by using a
pencil or a paper or a computer
program.
Measurement
number (i)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Length of
fish i (cm) Sorted
13.1
3.4
11.7
3.4
9.0
4.1
7.0
5.1
9.9
5.6
5.1
6.4
11.6
7.0
6.4
7.3
8.0
7.5
8.7
8.0
13.0
8.1
11.6
8.7
8.7
8.7
12.8
9.0
7.5
9.4
12.1
9.9
10.8
10.3
11.5
10.8
10.3
10.8
3.4
11.5
8.1
11.6
9.4
11.6
5.6
11.7
12.6
12.1
12.4
12.4
3.4
12.6
4.1
12.8
15.3
13.0
7.3
13.1
10.8
15.3
Length
class
2-4
4-6
6-8
8-10
10-12
12-14
14-16
Sum
C
l
a
sNumber of
sfish
2
3
5
6
7
6
1
30
8
Relative and cumulative frequency
Data Sorted
13.1
3.4
11.7
3.4
9.0
4.1
7.0
5.1
9.9
5.6
5.1
6.4
11.6
7.0
6.4
7.3
8.0
7.5
8.7
8.0
13.0
8.1
11.6
8.7
8.7
8.7
12.8
9.0
7.5
9.4
12.1
9.9
10.8
10.3
11.5
10.8
10.3
10.8
3.4
11.5
8.1
11.6
9.4
11.6
5.6
11.7
12.6
12.1
12.4
12.4
3.4
12.6
4.1
12.8
15.3
13.0
7.3
13.1
10.8
15.3
Class Class
Frequency
2-4
2
4-6
3
6-8
5
8-10
6
10-12
7
12-14
6
14-16
1
Sum
30
Relative Cumulative
0.067
0.067
0.100
0.167
0.167
0.333
0.200
0.533
0.233
0.767
0.200
0.967
0.033
1.000
1
Relative frequency is the proportion of the
observation within a class.
Cumulative frequency is the sum of the
relative frequency of all classes below and
including the class indicated.
Various ways for displaying frequency
Histogram
10
8
6
4
2
0
Relative frequency
0.2
0.1
14-16
12-14
10-12
8-10
6-8
4-6
0.0
2-4
14-16
12-14
10-12
8-10
6-8
4-6
2-4
Proportion
0.3
Length
Length (cm)
Cumulative frequency
Relative cumulative frequency
30
25
20
15
10
5
0
0.6
0.4
0.2
16
14
12
8
6
4
0.0
2
16
0.8
10
Length (cm)
14
12
10
8
6
4
Cumulative %
1.0
2
Cumulative n
Observations (n)
9
Lengt (cm)
How would one verbally describe: 1) the general characteristics of the data?
2) the different forms of presentations of the same data?
Some mathematical bookeeping
10












GENERAL
n: number of measurements
Lowest value: Xmin
Highest value: Xmax
Range: Xmax – Xmin
j: Class numbers
Class boundaries: L1, L2, .. Lj
Class range: dl = Lj+1 – Lj
Class midpoint: (Lj+1 – Lj)/2
nj: number of fish in class j
Relative frequency: nj / n
Cumulative frequency:

Hmm …, lets wait for that one












OUR EXMPLE
n = 30 fish
Xmin = 3.4 cm
Xmax = 15.3 cm
Range = 15.3 – 3.4 = 11.9 cm
j = 1, 2, … 7
Class boundaries: 2, 4,… 16 cm
Class range: 4 -2 = 2 cm
Class midpoint: (4+2)/2= 3 cm
nj = 2, 3, 5, 6, 7, 6, 1
0.067, 0.100, 0.167, …, 0.033
0.067, 0.167, 0.267, …, 1.000
Number of classes?
11

Generally no fewer than 5 and no greater than 15

Depends in part:


On the number of observations, the more observations the
greater the number of classes.
The nature of the data


If the sample is a composite of a lot of different elements we need
to have high number of classes. But that also means we need a lot
of measurements.
Some general guidelines


Square root of n
Sturge´s rule:
(Xmax-Xmin) / (1+1.44 ln(n))
Measure of central tendency
A value that is supposed to describe the most typical
or central point of the measurements
Measure of central tendency

A value that is supposed to
describe the most typical or
central point of the measurements



Arithmetic mean
Median
Mode
Mode
Median
Number of observations
13
Mean
500
450
400
350
300
250
200
150
100
50
0
0
5
10
15
20
25
30
35
40
The arithmetic mean
14

In mathematical notation:
1 n
1
x   xi   xi
n i 1
n i
1
  x1  x2  x3  ...  xn 
n



n: the total number of
measurements
i: The ith measurement
xi: the value of the ith
measurement
Example
Measurement
number (i)
1
2
3
4
5
n
Sum
Mean


Data set 1
Data set 2
Measurement Measurement
value (xi)
value (xi)
40
40
20
20
10
10
30
30
50
100
5
150
30
5
200
40
Note the effect of “outliers” on
the mean value
How well does the mean
describe the most typical
value?
The median
15

Median position:




Sort the measurements in an
ordered fashion from lowest to
the highest (ranked)
Find the median position:
(n+1)/2 of the ordered data
The median value: The value
of the observation in the
median position
Note if n is an even number
the median is the average of
the two central values:

E.g. 10 ,20 ,30 ,40 ,50 ,60
Median = 35
Example
Measurement Ordered
number (i) position
3
1
2
2
4
3
1
4
5
5
n
Median pos.: (n+1) / 2
Median value

Data set 1
Data set 2
Measurement Measurement
value (xi)
value (xi)
10
10
20
20
30
30
40
40
50
100
5
3
30
5
3
30
Note that the median is not
affected by the “outlier”
16
The mode

Mode = value that occur most often
Not sensitive to outliers
Problem: there may be no or many modes

E.g. 10 ,20 ,30 30, 30 ,40 ,50 ,60


Shapes of distributions
17
Left skewed
Symmetrical
Right skewed
Mode
Median
Mean
Mode
Median
Mean
500
450
400
350
300
250
200
150
100
50
0
Mode
Median
Mean
450
400
tail
500
450
400
350
300
250
200
150
100
50
0
350
300
250
200
150
100
50
0
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
30
35
40
tail
0
5
Left skewed:
Mode < Median < Mean
Symmetrical:
Mode = Median = Mean
Right skewed:
Mode > Median > Mean
10
15
20
25
30
35
40
Measure of variability
A value that is supposed to describe the distribution of
the measurements around the central value
Fractiles: General definitions
19

Range: Difference between the maximum and minimum
value



Range = xmax – xmin
Sensitive to outliers
Quantiles: Q1, Q2 and Q3 divide a data set into four equal
parts



Q1: 25th percentile
Q2: 50th percentile = Median
Q3: 75th percentile



Interquantile range = Q3-Q1
Less sensitive to outliers
Percentiles: P1, P2, … P100 divide a data set into 100
equal parts

Note relationship: Q1 = P25, Q2 = P50= Median, Q3 = P75
20
Fractiles: Box and whisker plots
Q1 = 25th percentile
Minimum
Measurements value
4
100
Q2 = 50th percentile
Median
Q3 = 75th percentile
200
400
Range = 600 – 4 = 596
Interquartile range = 400 – 100 = 300

Note


25% of observation are ≤ Q1, 50% ≤ Q2, 75% ≤ Q3
50% of the observations lies between Q1 and Q3
Maximum
600
Box and whiskers plots and distributions
21
Left skewed
Symmetrical
Right skewed
Mode
Median
Mean
Mode
Median
Mean
500
450
400
350
300
250
200
150
100
50
0
Mode
Median
Mean
450
400
tail
500
450
400
350
300
250
200
150
100
50
0
350
300
250
200
150
100
50
0
0
5
10
15
20
25
30
35
40
0
5
10
15
20
25
30
35
40
tail
0
5
10
15
20
25
30
Box and whisker plots give an indication of the central
value (here mode), the distribution of the data and the
shape of the distribution
35
40
22
Example of a quartile plot



Plot show the median
catch rate (CPUE) as a
function of time.
Plot shows the median
and the interquartile
catch rate as a function
of time
What additional
information does the
lower graph provide?
Example of a percentile plot
Proportion less than value
23
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
P90
P90: 90% of observations
with values less than 19
0
5
10
15
20
25
30
Measurement value
35
40
45
Deviations from the mean 1
24

In mathematical notation:
Deviation i = (Xi  X)



i: The ith measurement
n: the total number of
measurements
xi: the value of the ith
measurement
Example
Measurement Measurement
number (i) value (xi)
1
40
2
20
3
10
4
30
5
50
n
Sum
Mean
5
150
30
Deviationi
10
-10
-20
0
20
5
0
0
Deviation from the mean 2

Deviation from the mean
Deviation i = (Xi  X)
Measurement Measurement
number (i) value (Xi)
1
40
2
20
3
10
4
30
5
50
n
Sum
Mean

5
5
150
30
Deviationi
10
-10
-20
0
20
5
0
0
How can we characterize the
average deviation??

Plain average gives always
zero.
Measurement number (i)
25
4
Deviation
3
2
Observation
1
Mean
0
0
10
20
30
40
Value (Xi)
50
60
Variance & standard deviation 1
26

The variance
Whole population
Sample from population
N
s2 

 (Xi  m )
n
2
i 1
N
S2 
2
(X

X)
 i
i 1
n-1
Standard deviation: Square root of variance
N
s




 (X
i 1
i
 m)
N
n
2
S
 (X
i 1
i
 X) 2
n-1
Xi ith measurement of the variable X
X sample mean
m: population mean
s: population std. deviation
s: sample std. deviation
N: population size
n: sample size
Variance and standard deviation 2
Deviations
Measurement Measurement
number (i) value (Xi)
1
40
2
20
3
10
4
30
5
50
mean
Sums of squares
variance
standard deviation
relative standard deviation
X
Xi  X
10
-10
-20
0
20
SS    X i  X 
i
X
2
100
100
400
0
400
2
S  SS  n  1
2
S  S2
CV  S X
n
30
1000
250
15.8
0.53
5
5
Measurement number (i)
27
4
Deviation
3
2
Observation
1
Mean
0

Do you think that the value of
15.8 is a reasonable measure of
the average deviation in the data?
0
10
20
30
40
Value (Xi)
50
60
Variance and standard deviation 3
28
X  X 
X X
2
n
SS    X  X 
2
i 1
Measurement number (i)
400
20
5
0
0
4
10
2
100
10
100
1
0
0
10
+
20
30
40
Value (Xi)
50
1000
=
400
20
3
+
+
n
+
S2 
 X  X 
i 1
n 1
60
S S
2
2
250
15.8
Coefficient of variation
29
S
S
CVP 
or CV%  100
X
X

Measures of relative variation


Always a percentage (%) or a proportion of 1


CV = “Relative standard deviation”
Can be higher than 100%
Can be used to compare two or more sets of data
Data Data Data
set 1 Set 2 Set 3
Xbar
50
50
50
s
5
10
20
CVP 0.10 0.20 0.40
CV% 10% 20% 40%
Xbar
s
CVP
CV%
Data
set 4
Data
Set 5
Data
Set 6
50
10
0.20
20%
100
20
0.20
20%
1000
200
0.20
20%
The normal distribution
The normal distributions are a very important class of
statistical distributions. All normal distributions are
symmetric and have bell-shaped density curves with a
single peak.
Common distribution of measurements 1
31

450
400
Number of fish
350

300
250
200

150
100

50
0
20
30
40
50
60
70
80
Length (mm)

n=7073
Example: 7073 Icelandic cod
fish larvae lengths
measurements taken in august
2002.
Since we have many fish we
can use a length bin of 1 mm
to generate a frequency
distribution.
Most fish fall within a certain
narrow size range
The number of fish of a certain
length decrease the further
away one goes from the
central distribution.
Distribution is close to
symmetrical
Common distribution of measurements 2
32
450
450
400
400
350
350
300
300
Number of fish
Number of fish
Lets make an rough eyeball drawing
through the points
250
200
150
250
200
150
100
100
50
50
0
0
20
30
40
50
60
Length (mm)
70
80
20 25 30 35 40 45 50 55 60 65 70 75 80
Length (mm)
Can we describe this red line mathematically?
33
Normal distribution
n L  n  d Li
i
500
1
e
s 2
  X  X 2 
 
 1  i
 
 2
s

 

nLi: number in length class Li
dLi: width of length interval
Numbers
400
300
200
100
0
20 25 30 35 40 45 50 55 60 65 70 75 80
Length (mm)
The normal distribution
34
1
pdf 
e
s 2
  X  X 2 
 
 1  i
 
 2
s

 

pdf - probability density function
i - measured variable (here length of fish)
Xbar – the mean
s – the standard deviation

The model that describes the normal distribution is
complex at first sight …
What matters?
35

What parameters are in the equation?
1
pdf 
e
s 2




  X  X 2 
 
 1  i
 
 2
s

 

Xbar is the sample mean
s is the standard deviation
The rests (2, , e) are constants
The normal distribution is only “controlled” by the Xbar
and s, often written as: pdf  f  X , s 

In words we say that the normal distribution is a function of Xbar
and s.
36
pdf = f(Xbar,s), keep Mean(Xbar) =50, change “s”
n L  n  d Li
i
Number of fish
600
1
e
s 2
  X  X 2 
 
 1  i
 
 2
s
 
 
The central position (Xbar)
remains the same. The
higher the value of s the
greater the spread of the
curve.
s=10
s=5
s=20
500
400
Q: Is mean on its own a
useful measure?
300
200
100
0
20
30
40
50
60
70
80
37
pdf = f(Xbar, s), keep s=10, change Xbar
n L  n  d Li
i
Number of fish
300
1
e
s 2
  X  X 2 
 
 1  i
 
 2
s
 
 
The shape of the curve
remains the same. The
mean (Xbar) describes
the central location on
the x-axis.
Xbar = 50
250
Xbar = 40
Xbar = 60
200
150
100
50
0
20
30
40
50
60
70
80
38
What line describes the data distribution best?
n L  n  d Li
i
Number of fish
600
1
e
s 2
  X  X 2 
 
 1  i
 
 2
s
 
 
Assume the distribution is normal:
Find value of Xbar and s which
best describe the data.
Xbar=50, s=10
500
Xbar=50, s=5
Xbar=50, s=20
400
Observation
300
200
100
0
20
30
40
50
60
70
80
Answer: Xbar = 50, s = 10
In 2002 7073 larvae were measured.
The mean was 50 mm and the standard deviation 10 mm
500
Number of fish
39
400
300
200
100
s
s
0
20 25 30 35 40 45 50 55 60 65 70 75 80
Length (mm)
Can we say anything about probabilities?
40


Probabilities = likelihood  relative frequency
In presentation of data analysis we often have
statements like:

We expect that 95% of the population are within a certain
specified range of the data distribution

E.g. given the sample that I have, I expect that 95% of the
distribution of the fish population is between 30 and 70 mm.



This is sometimes written as: 50 ± 20 mm
How can we say this?
Why do we say this?
41







Although there are many normal curves, they all share an
important property that allows us to treat them in a uniform
fashion.
The 68-95-99.7% Rule
All normal density curves satisfy the following property which
is often referred to as the Empirical Rule.
68% of the observations fall within 1 standard deviation of
the mean, that is, between
and
.
95% of the observations fall within 2 standard deviations
of the mean, that is, between
and
.
99.7% of the observations fall within 3 standard
deviations of the mean, that is, between
and
.
Thus, for a normal distribution, almost all values lie within 3
standard deviations of the mean.
42




Note that these values are approximations :
For example according to the normal curve probability
density function,
95% of the data will fall within 1.96 standard
Deviation of the mean.
Using 2 standard deviations is a convenient
approximation.
What does 1.96 standard deviation mean?
Number of fish
45
In 2002 7073 larvae were measured.
The mean was 50 mm and the standard deviation 10 mm
95% of all the measurements (6719 larvae) fall within
1.96 standard deviation (30-70 mm) from the mean,
500
given that the data follow a normal distribution.
400
300
200
100
0
±2s
20 25 30 35 40 45 50 55 60 65 70 75 80
Length (mm)
But what does 1 standard deviation mean?
In 2002 7073 larvae were measured.
The mean was 50 mm and the standard deviation 10 mm
500
Number of fish
46
400
68% of all the measurements (4810 larvae) fall
within 1 standard deviation from the mean (40-60
mm), given that the data follow a normal
distribution
300
200
100
1s
0
20 25 30 35 40 45 50 55 60 65 70 75 80
Length (mm)
The Z score
47

In statistics the Z score is defined as:
Xi  X
Z
s
value of measurement i - mean

standard deviation
Hmm ... , have we
seen this formula
before??
deviation of measurement i

standard deviation
Xi  X
Z
s
The meaning of the Z-score
48


The Z-score standardizes the deviation from the mean of
a measurement relative to the standard deviation.
The Z-score value is a multiplier, indicating how many
standard deviation a particular measurement is from the
mean.
i
1
2
3
4
5
6
7
Xbar
s
Value (Xi)
20
30
40
50
60
70
80
50
10
Z-score
-3.00
-2.00
-1.00
0.00
1.00
2.00
3.00
i
1
2
3
4
5
6
7
Xbar
s
Value (Xi)
20
30
40
50
60
70
80
50
20
Z-score
-1.50
-1.00
-0.50
0.00
0.50
1.00
1.50
Xi  X
Z
s
The Z scores of our data
49
1s
2s
n
500
400
300
200
100
0
20
25
30
35
40
45
50
55
60
65
70
75
80
Length
n
500
400
300
200
100
0
-3
-2
-1
0
Z score
1
2
3
Cumulative relative distribution of Z scores
prop. of Fish < Length
50
1.0
0.9 84th%
0.8
0.7
0.6
68%
0.5
0.4
0.3
0.2 16th%
0.1
0.0
-3
-2
-1
0
1
2
3
Z score
The graph shows that -1s is the 16th percentile, +1 the 84th percentile. Thus
84-16 = 68% of the data lie within ± 1 s of the mean
Cumulative relative distribution of Z scores
prop < Z-score
51
The shape of this graph and the
values of Z and pdf are the same
for any normally distributed data
irrespective of the number of
measurements (n) and the value
of the mean and standard
deviation
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
-3
-2
-1
0
1
2
3
Z score
If we have a mean and a standard deviation from a sample and we assume
that the data are normally distributed we can say what the probability is
that the next sample we sample we take is less than a certain Z value.
E.g. Xbar = 100 mm, s = 20 mm. How likely is it that the next measurement
that is sampled is:
60 : Z score = (60-100)/20 = -3, probably very unlikely
120: Z score = (120-100)/10 = -2, 2.5% probability
Standard error (standard deviations of the means)
52

Standard error (or standard deviation of the mean)

estimates of the standard deviations of the means.
S
SE  S x 
n




We are effectively using the present sample to estimate what the likely
distribution of the means would be if we were to have repeated
measurements from the population.
The standard error is thus a value that can be used to estimate the
confidence interval of parametric mean from the sample mean, given
the distribution of the data
We assume that the means are normally distributed
Note: Standard deviation:

estimate of the dispersion of the individual observations from the mean
n
of a sample
 (Xi  X)2
S
i 1
n-1
Confidence intervals
53

The 95% confidence limits (CL) of a population mean
given the sample mean and standard error can be
calculated as follows:
S
x  CL  x  Z
n
S
x  CL  x  t95%,n1
n

n≥30, Z95%=1.96
n<30
tn-1 :fractiles (here 95%) of the Student t-distribution
with n-1 degrees of freedom


The distribution of t is similar as the normal distribution, but
varies with sample size less then 30.
When n>30, t = 1.96 (2) for the 95% confidence interval
Example
54

From our measurement of 0-group larvae we have:
x  49.8 mm
s  10.1 mm
n  7073

To calculate the 95% confidence interval of the mean we
have t=1.96.
10.1
10.1
49.8  1.96
 m  49.8 + 1.96
7073
7073
49.56  m  50.04

There is thus 95% probability that the interval contains
the true population mean value
How many samples should be collected?
55


Suppose that we require that the estimated mean
landings from samples should not deviated more than
7% (maximum relative error) from the true landings and
that we want to be 95% certain of this.
The maximum relative error of the mean can be
calculated from:
 max 


tn1,0.05
s
CV , where CV  100
X
n
Increasing sample size (n) lowers the maximum relative error
Higher CV (ratio of variance relative to the mean) results in
higher relative error for a given sample size
Graphically we have 
Sample size and max. relative error at 95% level
20%
Maximum relative error
56
CV = 10%
CV = 20%
15%
10%
5%
0%
0
10
20
30
40
Sample size (n)
50
60

Question: How many samples
are needed in order to be 95%
sure that the estimated mean
from the samples does not
deviate more than 7% from
the true mean?
% deviation from "true value"
Sample size and relative error at 95% level
57
20%
CV = 10%
CV = 20%
15%
10%
5%
0%
0
10
20
30
40
50
Sample size (n)

Answer: It depends on your CV.



If CV is 10%, need 10 samples to achieve the required precision
If CV is 20%, need 35 samples to achieve the required precision
Note:

Increasing the number of samples (for any given CV) does not
proportionally increase the precision of the value, the cost getting
disproportionately higher the closer one gets to the “true value”.
60
Final remark
58




The introduction on statistical analysis given here is only
a very brief overview, taking frequency distribution and
dispersion measure mostly from normally distributed
data.
A simple frequency plot is in essence a probability plot.
Graphical analysis/display of data/models can increase
the understanding of the concepts behind them.
Further suggested readings:


Haddon 2001, Chapter 3
Larson and Farber, Elementary Statistics
Related documents