Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BI230 Biostatistics Chapter 5 Continuous distributions: uniform, normal, exponential Probability density function • p : R → [0,1] such that – for all x, 0 ≤ p(x) ≤ 1 – • P( {x: a ≤ x ≤ b}) is definite integral of density function between a and b Expectation, stdev of continuous r.v. X Uniform density function If A=-1 and B=+1, then uniform density curve is below: Out[2]= Normal density function • symmetric about mean µ • Z-score = (x-µ)/σ • p(x) equals a constant times Standard normal density • In standard normal – µ = 0 – σ = 1 Standard normal density function Out[2]= By symmetry about x-axis, the mean of standard normal distribution equals 0 Integration by parts shows variance of standard normal equals 1 Cumulative distribution for standard normal This expression equals probability that X ≤ x P( {x: -infinity < u ≤ x} ) Integral cannot be solved analytically. Use Table A-2 at back of book, or use NORMDIST(x,0,1,TRUE) in Excel Applications of normal distribution Continuity correction • Continuity correction used when the continuous normal distribution used to approximate the discrete binomial distribution. Use Φ to approximate binomial probability in following examples Suppose p=0.1, n=100, so µ = 10, and σ=sqrt(npq)=3 • P(less than 30 successes) = P(k < 30) • P(more than 30 successes) = P(k > 30) Use Φ to approximate binomial probability in following examples Suppose p=0.1, n=100 • P(exactly 30 successes) = P(k = 30) • P(at least 30 successes) = P(k ≥ 30) • P(at most 30 successes) = P(k ≤ 30) Comparing cumulative binomial with normal distributions Comparing binomial with normal distributions (noncumulative) Superimposition of binomial and normal density graphs Sampling distribution of the mean • Sampling distribution of the mean is the distribution of sample means for all samples of size n. • Sampling distribution of the mean is approximately normal for large sample sizes CENTRAL LIMIT THEOREM • The mean of all sample means for samples of size n equals the population mean. – for n=1, this is by definition of population mean – inductively, one can establish the truth for any fixed sample size Die roll example: sampling WITH replacement, sample size of 2, compute average of sample variances Die roll example: sampling WITHOUT replacement, sample size of 2, compute average of sample variances Note: 5.333 =(2+6+8)/3 Illustrates that mean of samples equals population mean. Sampling distribution of a statistic • More generally, the sampling distribution of any statistic is the distribution of that statistic over all samples of size n. • For instance, the sampling distribution of the sample standard deviation (resp. variance) is the distribution of all sample standard deviations (resp. variances) over all samples of size n. Solution (first for sampling without replacement) Solution (second, for sampling with replacement) • For more examples of this form, see my demo python program demoSampleMeansVar.py, in class web pages. See also • outputDemoSampleMeansVarWithReplacement.tx t and • outputDemoSampleMeansVarWithoutReplacemen t.txt Bear age histogram: Mean:43.52, Var: 1116.03 StDev:33.41 Max:177 Min:8 Histogram of the means of (X+X+X)/3 for ALL size 3 samples of bear ages: Mean:43.52 Var: 372.01 StDev:19.29 Max:177 Min:8. NOTE that bear age stdev (previous slide) divided by sqrt(3), where 3 is sample size, equals 19.29 Histogram of sample variances of ALL size 3 samples of bear ages: Mean:1116.03 StDev:1567.25 Max:9520.33 Min:0. Note that the mean of all sample variances equals the population variance. Histogram of 10,000 repetitions of following experiment: count number of 2’s when rolling die 100 times Sampling distribution of the proportion • Sampling distribution of the proportion is the distribution of all sample proportions for all samples of (equal and fixed) size n • Proportion is number of “successes” over number of “trials” (number of people who plan to vote for candidate X over number of voters, number of red balls over number of balls in urn). Properties of distribution of sample proportions • sample proportions tend to target the value of the propulation proportion – relative frequency of r red balls in sample of size n is P(r/n) = h(n,r;N,R) • for sufficiently large sample sizes, the distribution of sample proportions approximates a normal distribution (recall that density curve of hypergeometric is approximately normal for sufficiently large sample size. Unbiased statistic • Unbiased statistic is a statistic that targets the population parameter • sample mean, sample variance, sample proportion are unbiased statistics that target population mean, variance and proportion • median, range, stdev are statistics that do not target corresponding population parameters. Central Limit Theorem: asymptotically, Z-scores converge to standard normal distribution Consequences of central limit theorem • Sampling distribution of the mean (which by definition is the distribution of sample means for all samples of equal and fixed size n) is approximately normal for large sample sizes. The expected value of the sampling distribution for the mean is the population mean. • Sampling distribution of the proportion is approximately normal for large sample sizes. The expected value of the sampling distribution of the proportion is the population mean. However ... • Sampling distribution of the variance is NOT necessarily approximately normal. In fact, we saw that such an example with bear age samples of size 3! • Nevertheless, the average of the sample variances does equal the population variance, provided that the sampling is done with replacement. Mean and stdev of sample means Mean, variance of sample distribution • Sampling with replacement (binomial): – Let X be r.v. counting number of successes in sample of size n, where probability of success is p – E[X] = np, V[X] = npq – Let Y be r.v. counting relative frequency of success in sample of size n – E[Y] = E[X/n] = np/n = p – V[Y] = V[X/n] =npq/n2 = pq/n = sigma2/n – stdev(Y) = sqrt(pq/n) = sigma/sqrt(n) Mean, variance of sample distribution • Sampling without replacement (hypergeometric): – Let X be r.v. counting number of successes in sample of size n, where probability of success is p – E[X] = np, V[X] = npq(N-n)/(N-1) – Let Y be r.v. counting relative frequency of success in sample of size n – E[Y] = E[X/n] = np/n = p – V[Y] = V[X/n] =V[X]2/n = (npq/n2)(N-n)/(N-1) = (pq/n)(N-n)/(N-1) – stdev(Y) = stdev(X)/sqrt(n) = sqrt[ (pq/n)(N-n)/(N-1)] Proportion of successes in n trials • Let X be r.v. that counts the number of successes in n trials, where probability of success is p (absolute frequency) • Let Y be r.v. that returns the proportion of successes in n trials (relative frequency) • E[Y] = E[X/n] = E[X]/n = np/n = p • V[Y] = V[X/n] = V[X]/n2 = npq/n2 = pq/n NOTE: We compute Z-score (x-mu)/(sigma/sqrt(n)) since we are working with sample distr of mean. Now compute Φ(Z-score) Normality test: quantile plot • sort data in increasing order x1,...,xn • for each i compute z-scores NORMINV(i/n,0,1), or rather following book NORMINV((2i-1)/2n,0,1) • determine if scatter plot (xi,zi) is linear Quantile plot for temperatures Z-scores 3 2 y = 0.0304x - 1.6244 1 Zscores 0 1 -1 -2 -3 10 19 28 37 46 55 64 73 82 91 100 Linear (Zscores) Interarrival time, or distance between successive genomic motifs What probability distribution is this? Exponential distribution continuous analogue of geometric distribution Image of exponential distribution for different values of ¸ (called ® in previous slide) Excel statistical functions • • • • • • • average mode stdev stdevp var varp max,min,median Excel statistical functions • QUARTILE – Returns the quartile of a data set (quartile = 0,1,2,3,4) – Syntax: QUARTILE(array,quart) • PERCENTILE – Returns the k-th percentile of values in a range. For example, you can decide to examine candidates who score above the 90th percentile. – Syntax: PERCENTILE(array,k) Here k is in interval [0,1] e.g. 0.9. • STANDARDIZE – Returns Z-score corresponding to x for a distribution with mean and stdev. – Syntax: STANDARDIZE(x,mean,stdev) • BINOMDIST – Returns the individual term binomial distribution probability. – Syntax: BINOMDIST(k,n,p,cumulative). Used for sampling with replacement Density function when cumulative = false Cumulative density function when cumulative = true • HYPGEOMDIST – Returns the probability of a given number of sample successes, given the sample size, population successes, and population size. – Syntax: HYPGEOMDIST(r,n,R,N) Used for sampling without replacement • PERMUT – Returns the number of ORDERED sequences of length k drawn from a set of size n. Permutations are different from combinations, for which order is not significant. – Syntax: PERMUT(n,k) • fact(k) – Returns the factorial k(k-1)(k-2)...1 – use permut and fact to compute combinations • POISSON – Returns the Poisson distribution. A common application of the Poisson distribution is predicting the number of events over a specific time, such as the number of cars arriving at a toll plaza in 1 minute. – Syntax: POISSON(x,mean,cumulative) • NORMDIST – Returns the normal distribution for the specified mean and standard deviation. – Syntax: NORMDIST(x,mean,stdev,cumulative) • NORMDIST – Returns the normal distribution for the specified mean and standard deviation. – Syntax: NORMDIST(x,mean,stdev,cumulative) • NORMSINV – Returns the inverse of the standard normal cumulative distribution. The distribution has a mean of zero and a standard deviation of one. – Syntax: NORMSINV(probability) Tables from Book