Download 3. Probability theory and distributions: Normal, binomial, Poisson

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Section III
Gaussian distribution
Probability distributions
(Binomial, Poisson)
Notation
Statistic
Sample
Population
Y
μ
S or SD
σ
P
π
mean difference d
δ
Correlation coeff
r
ρ
rate (regression) b
β
Num of obs
N
mean
Std deviation
proportion
n
Densities –Percentiles
BMI=22 is the 88th percentile
25%
20%
15%
10%
88%
5%
0%
14 15 16
17 18 19 20
21 22 23 24
x=BMI
25 26 27
Standard Z scores
Definition:
Z = (Y – mean)/ SD
Y = mean + Z SD
Z is how many SD units Y is above or below
mean.
Mean & SD might be sample (Y, S) or
population (μ,σ) values if population
values are known.
Survival data, mean=17.54, SD=11.68
Y
Y - mean
Z= (Y - mean)/SD
4
-13.54
-1.16
6
-11.54
-0.99
8
-9.54
-0.82
8
-9.54
-0.82
12
-5.54
-0.47
14
-3.54
-0.30
15
-2.54
-0.22
17
-0.54
-0.05
19
1.46
0.13
22
4.46
0.38
24
6.46
0.55
34
16.46
1.41
45
27.46
2.35
Standard Gaussian (Normal)
a distribution model
Standard Gaussian, μ=0, σ=1
0.45
0.40
0.35
0.30
0.25
0.20
34%
0.15
34%
0.10
0.05
16%
0.00
-3.50 -3.00 -2.50 -2.00 -1.50 -1.00 -0.50
16%
0.00
Z
0.50
1.00
1.50
2.00
2.50
3.00
3.50
Selected Gaussian percentiles
Z
-2.00
-1.96
-1.50
-1.00
0.00
1.00
1.50
1.96
2.00
lower area (P <Z)
2.28%
2.50%
6.68%
15.87%
50.00%
84.13%
93.32%
97.50%
97.72%
Gaussian percentiles
0.45
Standard Gaussian, μ=0, σ=1
0.40
0.35
0.30
0.25
0.20
0.15
area = 0.933=93.3%
Z= 1.5
0.10
0.05
0.00
-3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0
0.5
1.0
1.5
2.0
2.5
3.0
Z
EXCEL function =NORMSDIST(Z) gives percentile from Z.
EXCEL function =NORMSINV(p) gives Z from the percentile
3.5
Example- SAT Verbal
Mean=μ=500, SD=σ=100
What is your percentile if Y=700?
Z= (700-500)/100=2.0, area=0.977=97.7%
What score is the 80th percentile,
Z0.80=0.842
Y = 500 + 0.842 (100) = 584
What percent are between 450 and 500?
For Y=450, Z=(450-500)/100=-.5, area=0.3085
For Y=500, Z=0, area=0.5000,
so area between is 0.500-0.3085=0.1915=19%
Example- Anesthesia
Effective dose, μ=50 mg/kg, σ=10 mg/kg
Lethal dose, μ=110 mg/kg, σ=20 mg/kg
Q1= What dose with put 90% to sleep?
Q2- What is the risk of death from this dose?
Example- Anesthesia
Effective dose, μ=50 mg/kg, σ=10 mg/kg
Lethal dose, μ=110 mg/kg, σ=20 mg/kg
Q1= What dose with put 90% to sleep?
Z0.90=1.28, Y=50+1.28 (10) = 62.8 mg/kg
Q2- What is the risk of death from this dose?
Z=(62.8-110)/20= -2.36, area < 1%
Prediction intervals (not CI)
If μ and σ are known and the data is
known to have a Gaussian distribution,
the interval formed by
(μ-Zσ, μ+Zσ)
is the (2k-100th) prediction interval for
the kth percentile Z (Z>0).
Z=2, (μ-2σ, μ+2σ) is (approximately) the
95% prediction interval
Implies SD ≈ range/4 (extremes excluded)
Normal dist-differences & sums
If Y1,Y2 each have independent normal
distributions with means and SDs as below
variable
mean
SD
Y1
µ1
σ1
Y2
µ2
σ2
Then the difference & sum have normal dists.
mean
SD
.
diff=Y1-Y2
sum=Y1+Y2
µ1-µ2
µ1+µ2
sqrt(σ12 + σ22)
sqrt(σ12 + σ22)
Q: If σ1=σ2,what is mean diff with100% overlap?
Difference of two normals
Specificity & Sensitivity
For serum Creatinine in normal adults
 = 1.1 mg/dl  = 0.2 mg/dl
In one type of renal disease
 = 1.7 mg/dl  = 0.4 mg/dl
If a cutoff value of 1.6 mg/dl is used
Prob false pos= prob Y > 1.6 given normal
Prob false neg = prob Y < 1.6 given disease
Data transformations & logs
Some continuous variables follow the
Gaussian on a transformed scale, not the
original scale. Statland implies that perhaps
80% of continuous lab test variables follow
a Gaussian on either the original (50%) or a
transformed scale, usually the log scale.
(Clinical Decision Levels for lab Tests, 2nd ed,
1987, Med Econ)
Example-Bilirubin
Bilirubin umol/L
0
50
100
150
200
250
Log Bilirubin, log10 umol/L
300
350
0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25
Mean=64.3
Mean=1.55
Median=34.7
Median=1.54
SD=104.3
SD=0.456
n=216
n=216
95% prediction intervals
Original scale
Mean
64.3
SD
104.3
2 SD
208.6
log 10 scale
1.55
0.456
0.912
Lower
-144.3
0.64
Upper
272.9
2.46
*******************************************
Geometric mean=101.55=35.5 mmol/L
Prediction interval (100.64,102.46) or (4.3, 290)
Normal probability plot
Bilirubin – original scale
Normal plot- Bilirubin
Z assuming Gaussian
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
-1.0
0.0
1.0
2.0
3.0
4.0
5.0
observed Z = (Y-mean)/SD
Data is Gaussian if plot is a straight line- above not Gaussian
Normal probability plot
Bilirubin- log scale
Normal plot - log Bilirubin
Z assuming Gaussian
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
-3.0
-2.0
-1.0
0.0
1.0
2.0
observed Z=(Y-mean)/SD
Data is Gaussian if plot is a straight line as above
3.0
Log transformation (cont).
The distribution of ratios is much closer to
Gaussian on the log scale
The “inverse” of 3/1 is 1/3. This is
symmetric only on the log scale
Original: 100/1, 10/1, 1/1, 1/10, 1/100
Log:
2,
1,
0, -1, -2
true for OR, RR and HR
Measures of growth & proliferation have
distribution closer to the Gaussian on the
log scale
Data distributions that tend to be
Gaussian on the log scale
Growth measures - bacterial CFU
Ab or Ag titers (IgA, IgG, …)
pH
Neurological stimuli (dB, Snellen units)
Steroids, hormones (Estrogen, Testosterone)
Cytokines (IL-1, MCP-1, …)
Liver function (Bilirubin, Creatinine)
Hospital Length of stay (can be Poisson)
Quick Probability Theory
Mutually exclusive events: levels of one
variable
Blood type probability
A
30%
B
12%
AB
8%
O
50%
Probability A or O = 30% + 50%=80%.
Mutually exclusive probabilities add. All
(exhaustive) categories sum to 100%
Probability-Independent events
The probabilities of two independent events
multiply. (two or more variables)
If 5% of pregnant women have gestational
diabetes
If 8% of pregnant women have pre-eclampsia
Probability of gest. diabetes and preeclampsia = 5% x 8% = 0.4% if
independent.
Conditional probability
Probability of an event changes if made
conditional on another event. Probability
(prevalence) of TB is 0.1% in general
population.
In Vietnamese immigrants, TB probability is
4%.
Conditional on being a Vietnamese
immigrant, probability is 4%.
Conditional Probability & Bayes
n=1,000,000
A=Vietnamese
n=5000
A∩B
N=200
B=TB+
n=1000
Want prob TB|Vietnamese but can’t check all Vietnamese for TB
Conditonal Prob & Bayes Rule
What is TB prevalence in Orange Co
Vietnamese population?
Too hard to take census of all Vietnamese.
Assume we know:
P(A)=prop in Orange Co who are Viet=0.5%
P(B)=prop in Orange Co who have TB = 0.1%
P(A|B)=prop of those with TB who are Viet=20%
Want P(B|A) = P(A|B) P(B)/ P(A) =
(0.2 x 0.001)/(0.005) = 0.04=4%
Bayes rule for conditional probability (formula)
Probability of B given A = P(B|A)=
Joint probability of A and B/Probability of A=
P(A ∩ B)/P(A) =
Probability of A given B x Probability of B
Probability of A
Bayes rule: P(B|A)=[ P(A|B)P(B)] / P(A)
If A and B are independent, P(B|A)=P(B)
Also P(B) = ∑ P(B|Ai)
(sum over all Ai)
Example: Bayes rule
A=Vietnamese, B=TB+
In pop of 1,000,000,
5000 (0.5%=0.005) are Vietnamese=P(A),
1000 (0.1%=0.001) have TB+ =P(B).
Of 1000 with TB+, 200 (20%=0.20) are
Vietnamese=P(A|B)
Want prob. of TB given Vietnamese? =P(B|A).
P(B|A)= 0.20 (0.001)/0.005 = 0.04=4%.
=200/5000
Can’t test all Viet for TB+, can check all TB+ for Viet
Bayes rule (graph)
1,000,000 pop
B
Conditional probability of
TB+ given Vietnamese
A
5000 Viet
= 200/5000=4%
1000 TB+
A∩B
B|A
200 Viet + TB+
Check all TB+ for Viet rather than check all Viet for TB
Bayesian vs Frequentist
Bayesian computes Prob(hypothesis|data) =
Prob(data|hypothesis) P(hypothesis)
Prob(data)
= Data Likelihood x prior probability
If data (evidence) refutes a hypothesis
Prob(data | hypothesis)=0 so
Prob(hypothesis | data)=0
Frequentist computes
Prob(data*|hypothesis)= p value
* p value is prob of observed data or more extreme data
Binomial distribution
Population: Positive= π = 0.30, negative = 1- π = 0.70
Y= number of positive responses out of n trials
n=1
Y probability
0
0.700
1
0.300
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
0
1
n=2
Y
probability
0.50
0.40
0.30
0
0.49=0.7 x 0.7
0.20
0.10
1
2
0.42= 0.7 x 0.3 x 2
0.09= 0.3 x 0.3
0.00
0
1
2
n=3
Binomial (cont.)
Y
probability
0
0.343
1
0.441
0.20
2
0.189
0.00
3
0.027
0.50
0.40
0.30
0.10
0
1
2
3
n=4
Y
probability
0.50
0
0.2401
1
0.4116
0.40
0.30
0.20
2
0.2646
0.10
0.00
3
0.0756
4
0.0810
0
1
2
3
4
General binomial formula
Probability of y positive out of n
where π is prob of a single positive
= n!/[y!(n-y)!] πy (1-π)(n-y)
Mean=πn, SD=√nπ(1-π)
Ex:Prob of y=5 herpes cases out n=50
teens if herpes incidence=π=4%=0.04
Prob=50!/(5! 45!)(0.04)5(0.96)45=3.4%
Can compute using “=Binomdist(y,n,π,0)” in EXCEL
For example, =BINOMDIST(5,50,0.04,0) is 0.034
Binomial-fair coin example
for π=0.5, easy to compute y=number of “heads” (success) out of n
prob y out of n = n!/[y!(n-y)!] / 2n
Ex: n=3, flip 3 fair coins, 23=8 possibilities
0+0+0=0=y
0+0+1=1=y
0+1+0=1=y
1+0+0=1=y
0+1+1=2=y
1+0+1=2=y
1+1+0=2=y
1+1+1=3=y
y freq
0
1
1
3
2
3
3
1
total 8
prob
1/8
3/8
3/8
1/8
8/8
Pascal’s triangle
n
1
2
3
4
5
y: 0 to n “success”
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
2n
2
4
8
16
32
For n=5, prob(y=2) is 10/32
prob(y≤2) is (1+5+10)/32=16/32
Headache remedy success
The “old” headache remedy was successful
π=50% of the time, a true “population”
value well established after years of study.
A “new” remedy is tried in 10 persons and
is successful in 7 of the 10 (70%).
Is this enough evidence to “prove” that the
new remedy is better?
Hypothesis testing-Binomial
How likely is y=7 success out of n=10 if
π=0.5,
prob = 10!/(7!3!) / 210 = 120/1024=0.1172
How likely y=7 or more (p value)?
y
probability
7 120/1024 = 0.1172
8
45/1024 = 0.0439
9
10/1024 = 0.0098
10
1/1024 = 0.0010
total 176/1024= 0.1719 <- p value
How likely is observing y=70 success out of n=100
if π=0.5 for each trial?
Prob(y=70)=[100!/(70! 30!)] / 2100 = 2.32 x 10-5
How likely is it to observe 70 or more successes
out of 100?
pr(y=70) + pr(y=71) + …+pr(y=100) = 3.93 x 10-5
This is a simple example of hypothesis
testing. The probability of observing y=70
or more successes out of n=100 under the
“null hypothesis” that the true population
π=0.5 is called a one sided p value.
num success out of n=10, π=0.5
0.30
0.25
rel freq
0.20
0.15
0.10
0.05
0.00
0
1
2
3
4
5
6
7
num of success = y
8
9
10
Gaussian approximation to Binomial
ok for large n,
π not near 0 or 1
π =0.15, n=50, mean=0.15(50)=7.5, SD=√50(0.15)(0.85)=2.52
Binomial dist
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0 1
2 3 4 5 6
7 8 9 10 11 12 13 14 15 16 17 18
Actual 2.5th percentile is between 2 & 3, Gaussian 7.5-2(2.5)=2.5
Actual 97.5th percentile is between 12 and 13, Gaussian=7.5 +2(2.5)=12.5
Poisson distribution for count data
For a patient, y is a positive integer:
0,1,2,3,…
Probability of “y” responses (or events)
given mean μ
= (μy e-μ)/ (y!)
(Note: μ0=1 by definition)
For Poisson, if mean=μ then SD=√μ
Examples: Number of colds in a season,
num neurons fired in 30 sec (firing rate)
Poisson example
Q: If average num colds in a single winter is
μ=1.9, what is the probability that a given
patient will have 4 colds in one winter?
A: (1.9)4e-1.9/4x3x2x1 = 0.0812 ≈ 8%.
What is the probability of 4 or more (find for
0-3, subtract from 1), prob=12%
Can compute in EXCEL with “=POISSON(y,mean,0)”.
=POISSON(4, 1.9, 0) gives 0.0812.
=POISSON(4, 1.9, 1) gives cumulative probability of 4 or
less (4,3,2,1,0) which is 0.9559.
Poisson distribution
probability
Poisson distribution, mean=1.9, SD=1.38
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0
1
2
3
4
num colds
5
6
7
8
Poisson process
Mean rate of events is h events/unit=h
(Hazard rate). In T units, we expect
μ=hT events on average. Can
substitute this average (μ) into
(μy e-μ)/ (y!)
to get probability of “y” events in T
units.
Poisson process example
Example: Cancer clusters
Q: Given a cancer rate of h=3/1000 person-years, what is
the expected number of cases in 2 years in a population
of 1500?
A: Rate in 2 years is 2 x (3/1000) =h= 6/1000. Expected is
μ=hT= 6/1000 x 1500 = 9 cases.
Q: What is the probability of observing exactly 15 cases?
A: μ=9, Probability =(915 e-9)/15! = 0.019431≈ 2%.
Q: What is the probability of observing 15 or more cases in
1500 persons?
A: Plug in 0,1,2, …14 and add to get Q= probability of 14
or less. Probability is 1-Q = 1-0.958534 = 0.041466 ≈
4%.
Can compute with “=Poisson(y,μ,0)” in EXCEL for
probability of y events with mean μ. =Poisson(y,μ,1) gives
cumulative probability of y or less.
Summary: Descriptive stats for
Normal, Binomial & Poisson
n = sample size
Distribution
Normal
Binomial
Poisson
mean
µ
π
µ
variance
σ2
π(1-π)
µ
SD
σ
√π(1-π)
õ
SD = √variance, SE= SD/√n
SE
σ/√n
√π(1-π)/n
õ/n
Related documents