Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BIOL 582 Lecture Set 3 Probability Distributions Review BIOL 582 • We learned that we could empirically generate distributions for things like test statistics • Recall from R examples, 100 50 0 Frequency 150 200 Two samples (n1 =20), (n2=30), mean 1 = mean 2, sd 1 = sd 2, 1000 random permutations -3 -2 -1 0 mean 1 - mean 2 1 2 3 Review BIOL 582 • We learned that we could empirically generate distributions for things like test statistics • Recall from R examples, but a slight change 0.1 0.2 By changing frequency to density, the height of the distribution at any point measures the probability of the value found on the x axis 0.0 Density 0.3 0.4 Two samples (n1 =20), (n2=30), mean 1 = mean 2, sd 1 = sd 2, 1000 random permutations -3 -2 -1 0 mean 1 - mean 2 1 2 3 Review BIOL 582 • We learned that we could empirically generate distributions for things like test statistics • Recall from R examples, but a slight change Two samples (n1 =20), (n2=30), mean 1 = mean 2, sd 1 = sd 2, 1000 random permutations 0.2 0.1 0.0 Density 0.3 0.4 Red line is the probability density function for the standard normal distribution. It is “theoretical”. If the data fit this line well, it can be used as a proxy for estimating the probability of any event described on the x axis -3 -2 -1 0 mean 1 - mean 2 1 2 3 BIOL 582 Probability distributions • May be empirical (generated; becoming more common) or theoretical (based on probability theory; historically pervasive). • There are several different theoretical probability distributions, each having utility under certain conditions • Theoretical probability distributions are often called parametric distributions, as the attributes of such distributions are influenced by the “behavior” of various parameters. (E.g., the shape of the normal distribution is influenced by the behavior of the mean and variance) • Inferential statistical methods that rely on parametric probability distributions are called parametric tests. Probability distributions BIOL 582 Term Definition PMF Probability mass function. The function used to form a discrete probability distribution (E.g., binomial, Poisson) PDF Probability density function. The function used to form a continuous probability distribution (E.g., normal, log-normal, t, Chi-square, F) CMF, CDF Continuous probability mass or discrete function. Determines the cumulative probability of a range of events, but is otherwise related to PMF and PDF. We will not look at any, but they do exist and can be found easily in provided sources. Integration A concept of calculus, used to measure the area under the curve (AUC) of a PMF, PDF, CMF, or CDF. Usually written as b ò a f (x)dx = F(b) - F(a) = Pr ( a £ X £ b) The AUC of a probability function is the cumulative probability associated with limits a and b. mode Most frequently occurring value or the greatest height of a PMF or PDF tail Region of low probability for any PMF or PDF E(X) Expected value of a PMF or PDF; X is the variable of interest var(X) Variance of the distribution Probability distributions BIOL 582 Term Definition symmetric Tails are similar, mode in center of distribution skewed Tails are dissimilar, mode and mean not the same kurtosis “Peakedness” of distribution: platykurtic vs. leptokurtic BIOL 582 Type: Discrete Binomial Distribution Common Distributions PMF æ n ö k n-k Pr(X = k) = ç ÷ p (1- p) k è ø æ n ö n! where ç ÷= k k!(n - k)! è ø Parameters n trials p event probability (k events) E(X) = np var(X) = np(1- p) Use Categorical data; logistic regression – any case where Bernoulli Trials (success or failure outcome) is appropriate. E.g., disease research, nesting success in birds, environmental sex determination in turtles Binomial distribution for n = 20 p = 0.1 (blue), p = 0.5 (green) and p = 0.8 (red) x-axis is k (number of events = “success”) Taken from Wikipedia Common Distributions BIOL 582 Type: Discrete PMF Poisson Distribution Pr(X = k) = e Parameters -l (l) λ expected value (k expected event) k k! Use Count or ordinal data; logistic regression – when one is interesting in knowing the likelihood of countable random events. E.g., modeling disease outbreaks, behavior studies, genetic mutation research E(X) = var(X) = l Comparison of Poisson distributions Distributions 0.2 0.1 0.0 Density 0.3 lambda=20 lambda=12 lambda=8 lambda=4 0 5 10 15 X = k events 20 25 30 Common Distributions BIOL 582 Type: Continuous PDF Parameters Normal Distribution Pr(X = k) = Standard Normal means mean = 0 and standard deviation =1 1 2ps 2 e -(k-m )2 2s 2 μ expected value σ2 variance (k event value) E(X) = m var(X) = s 2 Use Most commonly used distribution in statistical analyses. Many parametric tests assume normally distributed errors or model parameters. The CLT indicates that test statistics should have normally distributions, even when derived from non-normal samples. Binomial and Poisson distributions that have large numbers of Bernoulli trials can be approximated by the Normal distribution. This list can get really long! 0.4 Comparison of Normal distributions 0.2 0.1 0.0 Density 0.3 Distributions sd=1 sd=2 sd=4 sd=8 -10 -5 0 X=k 5 10 Common Distributions BIOL 582 Type: Continuous Lognormal Distribution PDF Pr(X = k) = x>0 1 2kps 2 Parameters e μ expected value σ2 variance (k event value) -(ln k-m )2 2s 2 E(X) = em ( ) var(X) = es -1 e2 m+s 2 2 Use Often used as a data transformation to produce a normally distributed variable (because of link between the two); Often the distribution of a variable that is a factor of another positive random variable (E.g., weight and length). Other E.g., survival analysis, morphology, abundance studies Comparison of Logormal distributions (mu =1) Distributions 0.4 0.2 0.0 Density 0.6 log sd=1 log sd=0.7 log sd=0.4 log sd=0.2 0 2 4 6 X=k 8 10 Common Distributions BIOL 582 Type: Continuous PDF Parameters æ n +1 ö n +1 Gç ÷ 2 è 2 ø æ t ö 2 Pr(X = k) = ç1+ ÷ æn ö np G ç ÷ è n ø è2ø Student t Distribution t= ν degrees of freedom n subjects Γ Gamma function (k event value) E(X) = 0 k -m s var(X) = n / (n - 2) 2 n G(x) = ò ¥ x-1 -t 0 t e dt William Sealy Goset Use Often used as for t-test statistics of twosample tests, paired tests, or comparisons of regression parameter estimates to a theoretical value (usually 0). Can be used for many different parameters. Also has a link to Normal and F distributions. E.g. paired designs for before/after experimental treatments (dose/response), linear regression, correlation analysis. 0.4 Comparison of t Distributions Distributions 0.1 0.2 Notice that as n increases (meaning the df increases), the tdistribution converges on the normal distribution. One way to think of the t-distribution is that it is a standard normal distribution, corrected for small sample sizes. 0.0 Density 0.3 df=1 df=3 df=8 df=30 normal -6 -4 -2 0 t value 2 4 6 Common Distributions BIOL 582 Type: Continuous PDF F Distribution Parameters n (n1k ) n 2n n +n (n1k + n 2 ) kB (n1, n 2 ) 1 1 Pr(X = k) = B(x, y) = ν degrees of freedom (two parts) Γ Gamma function B Beta function (k event value) 2 2 G ( x ) G ( y) G ( x + y) E(X) = n 2 / (n 2 - 2) var(X) = Use The primary distribution for F statistics used in analysis of variance (ANOVA). Also used in population genetics. Rather universal for any research that involves evaluating components of linear models. 2n 22 (n 2 + v1 - 2) n1 (n 2 - 2) (n 2 - 4) 2 1.2 Comparison of F distributions Distributions 0.6 0.4 0.2 0.0 Density 0.8 1.0 df=4,10 df=4,100 df=8,10 df=8,100 df=12,10 df=12,100 df=20,10 df=20,100 0 1 2 3 F value 4 5 Common Distributions BIOL 582 Type: Continuous PDF F Distribution Parameters n (n1k ) n 2n n +n (n1k + n 2 ) kB (n1, n 2 ) 1 1 Pr(X = k) = B(x, y) = ν degrees of freedom (two parts) Γ Gamma function B Beta function (k event value) 2 2 G ( x ) G ( y) G ( x + y) E(X) = n 2 / (n 2 - 2) var(X) = Use The primary distribution for F statistics used in analysis of variance (ANOVA). Also used in population genetics. Rather universal for any research that involves evaluating components of linear models. 2n 22 (n 2 + v1 - 2) n1 (n 2 - 2) (n 2 - 4) 2 1.2 Comparison of F distributions Distributions 0.6 0.4 0.2 0.0 Density 0.8 1.0 df=4,10 df=4,100 df=8,10 df=8,100 df=12,10 df=12,100 df=20,10 df=20,100 0 1 2 3 F value 4 5 Common Distributions BIOL 582 Type: Continuous PDF Χ2 Distribution Pr(X = k) = 1 k n /2 e-k/2 2n /2 G (n / 2) k>0 Parameters Use The primary distribution for Χ2 statistics used in likelihood ratio tests, contingency tables, and categorical analysis. Also used to “fit” other distributions. E.g., allele frequencies, stepwise regression, model comparisons ν degrees of freedom Γ Gamma function (k event value) E(X) = n var(X) = 2n 0.5 Comparison of Chi-square distributions Distributions 0.0 0.1 0.2 Density 0.3 0.4 df=2 df=4 df=8 df=16 0 5 10 15 Chi-square value 20 25 30 BIOL 582 • • • • Final thoughts There are MANY more distributions. This is just a sample. These distributions are “simulations” for the distributions of variables, parameters, or text statistics. There are other ways to simulate distributions. These are all parametric distributions. Often, one asks if the data “Fit” a distribution. • Using a PMF or PDF, one can estimate the expected values of a theoretical distribution. • One can then compare that to observed densities (frequencies). c =å 2 • (O - E ) 2 E • Which has degrees of freedom equal to the “bins” for comparison • Thus, a theoretical distribution can be used to see if data fit some other distribution This lecture should be referenced anytime we use a parametric test with a specific distributional form for estimating the probability of a type I error.