Download 1 Distributions and Confidence Intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Random variables
Distributions and Confidence
Intervals
Any characteristic that can be measured or categorised is called a
variable
If a variable can assume a number of different values, such that
any value is determined by chance, it is called a random
variable
Examples of random variables:
The number of boys in a 2-child family
The age of a randomly chosen patient at a clinic
The smoking status of a randomly chosen member of this class
Probability Distribution
Probability distribution of a continuous random variable
The probability distribution is the breakdown of the total
probability i.e. a description of
all the possibilities and their probabilities
Where the number of possibilities is not too many, we can
easily find the probability distribution.
Example: The number of boys in a 2-child family
The possibilities are 0, 1 or 2
=
.25
Probability (1 boy)
=
.50
Probability (2 boys)
=
.25
Probability (0 boys)
When there is a whole range of possibilities (e.g. weight, height ,
cholesterol..) then we cannot itemise all the possibilities. How can
we describe the probability distribution?
Remember:
the probability of an event is the proportion of the time this event
would occur in the long run.
Total probability =1 (100%)
This is the probability distribution
1
Relative frequency histogram
(histogram of proportions)
Probability distribution of a continuous random variable
ALT (log scale) in a group of
Irish Hepatitis C patients
Example: If we select a person at random from a population, the
probability that his/her height is between 170 and 180cm = the
proportion of the population with height between 170 and 180cm
.3
• the total shaded area =1.
So if we can find a way to describe the proportion of the
population in any specified range, we will have described the
probability distribution
• The sample histogram is
an estimate of the
population distribution,
which is represented by a
smooth curve.
Fraction
.2
• each bar area =
proportion of the sample
with a value in that range
.1
0
2
is a curve where:
• The total area =1.
• The area between any
two values = proportion
of the population with a
value in that range
.2
.15
.1
.05
When the sample histogram is
approximately normal, we infer
that the population distribution
is exactly (or almost!) normal.
5
6
(a) The centre is at the population
mean, denoted by μ
(b) The variation of the population is
given by the standard deviation
(SD), denoted by σ
(c) The total enclosed area = 1 (100%)
(d) The area between any two values
equals the proportion of the
population in that range.
(e) The area between μ ±1.96σ is 95%
.25
Fraction
So, the curve shows how
the total probability (=1)
is divided.
4
logalt
For any normal distribution:
Probability distribution of a continuous random variable…..
…….
3
0
2
3
4
logalt
5
6
2
Exercise 1: Use Normal Tables to find……
1. The proportion of a normal population that lies within
1 SD either side of the mean
2. How many SDs each side of the mean one must go
to include exactly 90% of the population
Reference Ranges (“Normal Ranges”)
„
„
Diagnostic test results (especially clinical chemistry) are usually
classified as normal (disease free) or abnormal (diseased) based on a
cut-off value.
Tests often have a range of values (i.e. an upper and lower limit)
specified which should include the majority of the normal (i.e.
healthy) population. This is called
„ Reference interval
„ Normal Range
„ Reference Range
Values outside this interval are considered abnormal
How to Construct Reference Intervals
If we know that the biochemical marker being measured is normally
distributed in the healthy population, then we can say that
95% of the healthy population have values between
Notation
We always use
μ
μ ±1.96σ
σ
Or for higher specificity:
99% of the healthy population have values between μ ±2.58σ
Of course, we never know μ and σ, as we never have a value
for everyone in the healthy population
In practice, we estimate the reference range by using the
mean and SD from a sample instead of the true population
values μ and σ
to denote the mean value of a continuous
measurement in the
population,
to denote the standard deviation.
These quantities are called parameters
In practice, we rarely know μ and σ, but
estimate them by the statistics
sample mean:
m=
1 n
∑ xi
n i =1
sample standard deviation:
s=
2
1
n
∑ ( xi − x)
n − 1 i =1
3
To construct a reference range:
we need…..
„
„
„
A representative sample of a reasonable size from the healthy population
To check that the variable is normally distributed: for many serum constituents, the log
transform is normally distributed
(use histogram)
If we are satisfied that we have a normal distribution, then we proceed to calculate:
sample mean (m)
sample standard deviation (s)
Our approximate 95% reference range is then:
Example of constructing a reference range
30 healthy male hospital staff have level of AAP (alanine aminopeptidase)
measured, giving a mean of 1.05 and standard deviation =.32.
Assuming a normal distribution for AAP (should check!) we would expect approx.
95% of healthy males to have AAP between
1.05 ± 1.96 (.32)
= .44 to 1.69
⇒ a value higher than 1.69 may suggest diabetes.
m ± 1.96*s
Criticisms:
30 is not such a big sample
„
Hospital staff may not be representative of the healthy population
„
Reference Intervals..some cautionary remarks
Estimating a population mean
„
„
Must consider if sample is representative of our population (e.g. kits
manufactured in different country??)
„
„
„
Have age, sex and other differences been considered?
Suppose we are interested in estimating the average length of Swedish
citizens.
We select a number of Swedish citizens and measure their length.
What is our best guess of the average length?
The mean of our sample: m
But you know that this might have been different in another sample, so
how can you quanitfy this uncertainty it creates? Create a confidence
interval!
Sometimes a one-sided cut-off is of interest, and sometimes an interval
(i.e. we use one tail vs two tails of the normal distribution)
4
Confidence Intervals
Confidence Intervals
Statistical theory shows that 95% of the time the sample mean will fall
„
„
within +/- 1.96
σ
n
of the true population mean μ
So, if you take your sample mean m and +/- 1.96
s
s ⎞
⎛
, m + 1.96
⎜ m − 1.96
⎟
n
n⎠
⎝
σ
n
you have a 95% chance of “capturing” μ in this interval
(called a 95% Confidence Interval)
Estimating a population proportion
„
Suppose now we are interested in estimating the proportion
of Swedish citizens taht are longer than 180 cm.
Our best geuss is:
„
p=number of ind.>180/number of ind. in sample
„
How do we quantify the uncertainty in this estimate?
„
95% confidence interval is given by:
„
But we rarely know σ, so how can we find a 95% CI?
Use s to estimate σ (approx 95% CI, OK if sample is large, > 30)
⎛
p (1 − p )
p (1 − p ) ⎞
⎜ p − 1.96
⎟
, p + 1.96
⎜
⎟
n
n
⎝
⎠
Standar Error (SE)
s
n
p(1 − p )
n
is called the “standard error” of the sample mean
is called the “standard error” of the sample proportion
95% Confidence Interval ≅ point estimate ± 2 standard errors
5
Interpreting confidence intervals:
(see course pack, page )
„
„
„
„
We are 95% confident that the overall incidence rate in M13-M21 is
between 1.59% and 3.05%
If someone claims overall rate is “more than 1%”, we would accept
If they claim that overall rate is 3%, we would accept
If they claim that overall rate is 5%, we would reject
6