Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Section III Gaussian distribution Probability distributions (Binomial, Poisson) Notation Statistic Sample Population Y μ S or SD σ P π mean difference d δ Correlation coeff r ρ rate (regression) b β Num of obs N mean Std deviation proportion n Densities –Percentiles BMI=22 is the 88th percentile 25% 20% 15% 10% 88% 5% 0% 14 15 16 17 18 19 20 21 22 23 24 x=BMI 25 26 27 Standard Z scores Definition: Z = (Y – mean)/ SD Y = mean + Z SD Z is how many SD units Y is above or below mean. Mean & SD might be sample (Y, S) or population (μ,σ) values if population values are known. Survival data, mean=17.54, SD=11.68 Y Y - mean Z= (Y - mean)/SD 4 -13.54 -1.16 6 -11.54 -0.99 8 -9.54 -0.82 8 -9.54 -0.82 12 -5.54 -0.47 14 -3.54 -0.30 15 -2.54 -0.22 17 -0.54 -0.05 19 1.46 0.13 22 4.46 0.38 24 6.46 0.55 34 16.46 1.41 45 27.46 2.35 Standard Gaussian (Normal) a distribution model Standard Gaussian, μ=0, σ=1 0.45 0.40 0.35 0.30 0.25 0.20 34% 0.15 34% 0.10 0.05 16% 0.00 -3.50 -3.00 -2.50 -2.00 -1.50 -1.00 -0.50 16% 0.00 Z 0.50 1.00 1.50 2.00 2.50 3.00 3.50 Selected Gaussian percentiles Z -2.00 -1.96 -1.50 -1.00 0.00 1.00 1.50 1.96 2.00 lower area (P <Z) 2.28% 2.50% 6.68% 15.87% 50.00% 84.13% 93.32% 97.50% 97.72% Gaussian percentiles 0.45 Standard Gaussian, μ=0, σ=1 0.40 0.35 0.30 0.25 0.20 0.15 area = 0.933=93.3% Z= 1.5 0.10 0.05 0.00 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Z EXCEL function =NORMSDIST(Z) gives percentile from Z. EXCEL function =NORMSINV(p) gives Z from the percentile 3.5 Example- SAT Verbal Mean=μ=500, SD=σ=100 What is your percentile if Y=700? Z= (700-500)/100=2.0, area=0.977=97.7% What score is the 80th percentile, Z0.80=0.842 Y = 500 + 0.842 (100) = 584 What percent are between 450 and 500? For Y=450, Z=(450-500)/100=-.5, area=0.3085 For Y=500, Z=0, area=0.5000, so area between is 0.500-0.3085=0.1915=19% Example- Anesthesia Effective dose, μ=50 mg/kg, σ=10 mg/kg Lethal dose, μ=110 mg/kg, σ=20 mg/kg Q1= What dose with put 90% to sleep? Q2- What is the risk of death from this dose? Example- Anesthesia Effective dose, μ=50 mg/kg, σ=10 mg/kg Lethal dose, μ=110 mg/kg, σ=20 mg/kg Q1= What dose with put 90% to sleep? Z0.90=1.28, Y=50+1.28 (10) = 62.8 mg/kg Q2- What is the risk of death from this dose? Z=(62.8-110)/20= -2.36, area < 1% Prediction intervals (not CI) If μ and σ are known and the data is known to have a Gaussian distribution, the interval formed by (μ-Zσ, μ+Zσ) is the (2k-100th) prediction interval for the kth percentile Z (Z>0). Z=2, (μ-2σ, μ+2σ) is (approximately) the 95% prediction interval Implies SD ≈ range/4 (extremes excluded) Normal dist-differences & sums If Y1,Y2 each have independent normal distributions with means and SDs as below variable mean SD Y1 µ1 σ1 Y2 µ2 σ2 Then the difference & sum have normal dists. mean SD . diff=Y1-Y2 sum=Y1+Y2 µ1-µ2 µ1+µ2 sqrt(σ12 + σ22) sqrt(σ12 + σ22) Q: If σ1=σ2,what is mean diff with100% overlap? Difference of two normals Specificity & Sensitivity For serum Creatinine in normal adults = 1.1 mg/dl = 0.2 mg/dl In one type of renal disease = 1.7 mg/dl = 0.4 mg/dl If a cutoff value of 1.6 mg/dl is used Prob false pos= prob Y > 1.6 given normal Prob false neg = prob Y < 1.6 given disease Data transformations & logs Some continuous variables follow the Gaussian on a transformed scale, not the original scale. Statland implies that perhaps 80% of continuous lab test variables follow a Gaussian on either the original (50%) or a transformed scale, usually the log scale. (Clinical Decision Levels for lab Tests, 2nd ed, 1987, Med Econ) Example-Bilirubin Bilirubin umol/L 0 50 100 150 200 250 Log Bilirubin, log10 umol/L 300 350 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 Mean=64.3 Mean=1.55 Median=34.7 Median=1.54 SD=104.3 SD=0.456 n=216 n=216 95% prediction intervals Original scale Mean 64.3 SD 104.3 2 SD 208.6 log 10 scale 1.55 0.456 0.912 Lower -144.3 0.64 Upper 272.9 2.46 ******************************************* Geometric mean=101.55=35.5 mmol/L Prediction interval (100.64,102.46) or (4.3, 290) Normal probability plot Bilirubin – original scale Normal plot- Bilirubin Z assuming Gaussian 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 observed Z = (Y-mean)/SD Data is Gaussian if plot is a straight line- above not Gaussian Normal probability plot Bilirubin- log scale Normal plot - log Bilirubin Z assuming Gaussian 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 observed Z=(Y-mean)/SD Data is Gaussian if plot is a straight line as above 3.0 Log transformation (cont). The distribution of ratios is much closer to Gaussian on the log scale The “inverse” of 3/1 is 1/3. This is symmetric only on the log scale Original: 100/1, 10/1, 1/1, 1/10, 1/100 Log: 2, 1, 0, -1, -2 true for OR, RR and HR Measures of growth & proliferation have distribution closer to the Gaussian on the log scale Data distributions that tend to be Gaussian on the log scale Growth measures - bacterial CFU Ab or Ag titers (IgA, IgG, …) pH Neurological stimuli (dB, Snellen units) Steroids, hormones (Estrogen, Testosterone) Cytokines (IL-1, MCP-1, …) Liver function (Bilirubin, Creatinine) Hospital Length of stay (can be Poisson) Quick Probability Theory Mutually exclusive events: levels of one variable Blood type probability A 30% B 12% AB 8% O 50% Probability A or O = 30% + 50%=80%. Mutually exclusive probabilities add. All (exhaustive) categories sum to 100% Probability-Independent events The probabilities of two independent events multiply. (two or more variables) If 5% of pregnant women have gestational diabetes If 8% of pregnant women have pre-eclampsia Probability of gest. diabetes and preeclampsia = 5% x 8% = 0.4% if independent. Conditional probability Probability of an event changes if made conditional on another event. Probability (prevalence) of TB is 0.1% in general population. In Vietnamese immigrants, TB probability is 4%. Conditional on being a Vietnamese immigrant, probability is 4%. Conditional Probability & Bayes n=1,000,000 A=Vietnamese n=5000 A∩B N=200 B=TB+ n=1000 Want prob TB|Vietnamese but can’t check all Vietnamese for TB Conditonal Prob & Bayes Rule What is TB prevalence in Orange Co Vietnamese population? Too hard to take census of all Vietnamese. Assume we know: P(A)=prop in Orange Co who are Viet=0.5% P(B)=prop in Orange Co who have TB = 0.1% P(A|B)=prop of those with TB who are Viet=20% Want P(B|A) = P(A|B) P(B)/ P(A) = (0.2 x 0.001)/(0.005) = 0.04=4% Bayes rule for conditional probability (formula) Probability of B given A = P(B|A)= Joint probability of A and B/Probability of A= P(A ∩ B)/P(A) = Probability of A given B x Probability of B Probability of A Bayes rule: P(B|A)=[ P(A|B)P(B)] / P(A) If A and B are independent, P(B|A)=P(B) Also P(B) = ∑ P(B|Ai) (sum over all Ai) Example: Bayes rule A=Vietnamese, B=TB+ In pop of 1,000,000, 5000 (0.5%=0.005) are Vietnamese=P(A), 1000 (0.1%=0.001) have TB+ =P(B). Of 1000 with TB+, 200 (20%=0.20) are Vietnamese=P(A|B) Want prob. of TB given Vietnamese? =P(B|A). P(B|A)= 0.20 (0.001)/0.005 = 0.04=4%. =200/5000 Can’t test all Viet for TB+, can check all TB+ for Viet Bayes rule (graph) 1,000,000 pop B Conditional probability of TB+ given Vietnamese A 5000 Viet = 200/5000=4% 1000 TB+ A∩B B|A 200 Viet + TB+ Check all TB+ for Viet rather than check all Viet for TB Bayesian vs Frequentist Bayesian computes Prob(hypothesis|data) = Prob(data|hypothesis) P(hypothesis) Prob(data) = Data Likelihood x prior probability If data (evidence) refutes a hypothesis Prob(data | hypothesis)=0 so Prob(hypothesis | data)=0 Frequentist computes Prob(data*|hypothesis)= p value * p value is prob of observed data or more extreme data Binomial distribution Population: Positive= π = 0.30, negative = 1- π = 0.70 Y= number of positive responses out of n trials n=1 Y probability 0 0.700 1 0.300 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0 1 n=2 Y probability 0.50 0.40 0.30 0 0.49=0.7 x 0.7 0.20 0.10 1 2 0.42= 0.7 x 0.3 x 2 0.09= 0.3 x 0.3 0.00 0 1 2 n=3 Binomial (cont.) Y probability 0 0.343 1 0.441 0.20 2 0.189 0.00 3 0.027 0.50 0.40 0.30 0.10 0 1 2 3 n=4 Y probability 0.50 0 0.2401 1 0.4116 0.40 0.30 0.20 2 0.2646 0.10 0.00 3 0.0756 4 0.0810 0 1 2 3 4 General binomial formula Probability of y positive out of n where π is prob of a single positive = n!/[y!(n-y)!] πy (1-π)(n-y) Mean=πn, SD=√nπ(1-π) Ex:Prob of y=5 herpes cases out n=50 teens if herpes incidence=π=4%=0.04 Prob=50!/(5! 45!)(0.04)5(0.96)45=3.4% Can compute using “=Binomdist(y,n,π,0)” in EXCEL For example, =BINOMDIST(5,50,0.04,0) is 0.034 Binomial-fair coin example for π=0.5, easy to compute y=number of “heads” (success) out of n prob y out of n = n!/[y!(n-y)!] / 2n Ex: n=3, flip 3 fair coins, 23=8 possibilities 0+0+0=0=y 0+0+1=1=y 0+1+0=1=y 1+0+0=1=y 0+1+1=2=y 1+0+1=2=y 1+1+0=2=y 1+1+1=3=y y freq 0 1 1 3 2 3 3 1 total 8 prob 1/8 3/8 3/8 1/8 8/8 Pascal’s triangle n 1 2 3 4 5 y: 0 to n “success” 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 2n 2 4 8 16 32 For n=5, prob(y=2) is 10/32 prob(y≤2) is (1+5+10)/32=16/32 Headache remedy success The “old” headache remedy was successful π=50% of the time, a true “population” value well established after years of study. A “new” remedy is tried in 10 persons and is successful in 7 of the 10 (70%). Is this enough evidence to “prove” that the new remedy is better? Hypothesis testing-Binomial How likely is y=7 success out of n=10 if π=0.5, prob = 10!/(7!3!) / 210 = 120/1024=0.1172 How likely y=7 or more (p value)? y probability 7 120/1024 = 0.1172 8 45/1024 = 0.0439 9 10/1024 = 0.0098 10 1/1024 = 0.0010 total 176/1024= 0.1719 <- p value How likely is observing y=70 success out of n=100 if π=0.5 for each trial? Prob(y=70)=[100!/(70! 30!)] / 2100 = 2.32 x 10-5 How likely is it to observe 70 or more successes out of 100? pr(y=70) + pr(y=71) + …+pr(y=100) = 3.93 x 10-5 This is a simple example of hypothesis testing. The probability of observing y=70 or more successes out of n=100 under the “null hypothesis” that the true population π=0.5 is called a one sided p value. num success out of n=10, π=0.5 0.30 0.25 rel freq 0.20 0.15 0.10 0.05 0.00 0 1 2 3 4 5 6 7 num of success = y 8 9 10 Gaussian approximation to Binomial ok for large n, π not near 0 or 1 π =0.15, n=50, mean=0.15(50)=7.5, SD=√50(0.15)(0.85)=2.52 Binomial dist 0.18 0.16 0.14 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Actual 2.5th percentile is between 2 & 3, Gaussian 7.5-2(2.5)=2.5 Actual 97.5th percentile is between 12 and 13, Gaussian=7.5 +2(2.5)=12.5 Poisson distribution for count data For a patient, y is a positive integer: 0,1,2,3,… Probability of “y” responses (or events) given mean μ = (μy e-μ)/ (y!) (Note: μ0=1 by definition) For Poisson, if mean=μ then SD=√μ Examples: Number of colds in a season, num neurons fired in 30 sec (firing rate) Poisson example Q: If average num colds in a single winter is μ=1.9, what is the probability that a given patient will have 4 colds in one winter? A: (1.9)4e-1.9/4x3x2x1 = 0.0812 ≈ 8%. What is the probability of 4 or more (find for 0-3, subtract from 1), prob=12% Can compute in EXCEL with “=POISSON(y,mean,0)”. =POISSON(4, 1.9, 0) gives 0.0812. =POISSON(4, 1.9, 1) gives cumulative probability of 4 or less (4,3,2,1,0) which is 0.9559. Poisson distribution probability Poisson distribution, mean=1.9, SD=1.38 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 1 2 3 4 num colds 5 6 7 8 Poisson process Mean rate of events is h events/unit=h (Hazard rate). In T units, we expect μ=hT events on average. Can substitute this average (μ) into (μy e-μ)/ (y!) to get probability of “y” events in T units. Poisson process example Example: Cancer clusters Q: Given a cancer rate of h=3/1000 person-years, what is the expected number of cases in 2 years in a population of 1500? A: Rate in 2 years is 2 x (3/1000) =h= 6/1000. Expected is μ=hT= 6/1000 x 1500 = 9 cases. Q: What is the probability of observing exactly 15 cases? A: μ=9, Probability =(915 e-9)/15! = 0.019431≈ 2%. Q: What is the probability of observing 15 or more cases in 1500 persons? A: Plug in 0,1,2, …14 and add to get Q= probability of 14 or less. Probability is 1-Q = 1-0.958534 = 0.041466 ≈ 4%. Can compute with “=Poisson(y,μ,0)” in EXCEL for probability of y events with mean μ. =Poisson(y,μ,1) gives cumulative probability of y or less. Summary: Descriptive stats for Normal, Binomial & Poisson n = sample size Distribution Normal Binomial Poisson mean µ π µ variance σ2 π(1-π) µ SD σ √π(1-π) √µ SD = √variance, SE= SD/√n SE σ/√n √π(1-π)/n √µ/n