Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 5 probability model normal distribution & binomial distribution [email protected] Contents Normal distribution for continuous data Binomial distribution for binary categorical data 2 The Normal Distribution The most important distribution in statistics. Normal distribution Introduction to normal distribution History Parameters and shape standard normal distribution and Z score Area under the curve Application Estimate of frequency distribution Reference interval (range) in health_related field. 4 histroy-Normal Distribution Johann Carl Friedrich Gauss Germany One of the greatest mathematician Applied in physics, astronomy Gaussian distribution (1777~1855) Mark and Stamp in memory of Gauss. 5 The Most Important Distribution Many real life distributions are approximately normal. such as height, EFV1,weight, IQ, and so on. Many other distributions can be almost normalized by appropriate data transformation (e.g. taking the log). When log X has a normal distribution, X is said to have a lognormal distribution. 6 Frequency distributions of heights of adult men. (a) (b) (c) (d) 7 Sample & Population Histogram the area of the bars Cumulative relative frequency in the sample, the proportion of the boys of age 12 that are lower than a specified height. normal distribution curve The area under the curve The cumulative probability. In the population. Generally speaking, the chance that a boy of aged 12 is lower than a specified height if he grow normally 8 Definition of Normal distribution X ~ N(,2), X is distributed as normal distribution with mean and variance 2. The probability density function (PDF) f (x) for a normal distribution is given by f (X) 1 2 e ( X )2 2 2 (- < X < +) Where: e = 2.7182818285, base of natural logarithm = 3.1415926536, ratio of the circumference of a circle to the diameter. 9 The shape of a normal distribution .4 f (X) f(x) 1 2 e ( X )2 2 2 .3 .2 .1 0 x 10 The normal distributions with the equal variance but different means 3 1 2 11 The normal distributions with the same mean but different variances 2 1 3 12 Properties Of Normal Distribution & completely determine the characterization of the normal distribution. Mean, median , mode are equal The curve is symmetric about mean. The relationship between and the area under the normal curve provides another main characteristic of the normal distribution. 13 Areas under the Standard Normal Curve A variable that has a normal distribution with mean 0 and variance 1 is called the standard normal variate and is commonly designated by the letter Z. N(0,1) As with any continuous variable, probability calculations here are always concerned with finding the probability that the variable assumes any value in an interval between two specific points a and b. 14 Cumulative distribution Function ( the area under the curve) from -∞ to x, cumulative Probability S(-, )=1 Example: What is the probability of obtaining a z value of 0.5 or less? We have 15 Area under standard normal distribution (Z) Z -3.0 -2.5 -2.0 -1.9 -1.6 -1.0 -0.5 0 0.00 0.0013 0.0062 0.0228 0.0287 0.0548 0.1587 0.3085 0.5000 -0.02 0.0013 0.0059 0.0217 0.0274 0.0526 0.1539 0.3015 0.4920 -0.04 0.0012 0.0055 0.0207 0.0262 0.0505 0.1492 0.2946 0.4840 -0.06 0.0011 0.0052 0.0197 0.0250 0.0485 0.1446 0.2877 0.4761 -0.08 0.0010 0.0049 0.0188 0.0239 0.0465 0.1401 0.2810 0.4681 Z 0 Z is the standard score, that is the units of standard deviation. 16 Figure Standard normal curve and some important divisions. •P(-1 < z < 1)=0.6826 •P(-2 < z < 2)=0.9545 •P(-3 < z < 3)=0.9974 17 Find probability in Excel Using an electronic table, find the area under the standard normal density to the left of 2.824. We use the excel2007 function NORMSDIST evaluated at 2.824 [NORMSDIST(2.824)]with the result as follows: 18 EXAMPLE What is the probability of obtaining a z value between 1.0 and 1.58? We have 19 CUMULATIVE PROBABILITY FOR X~N(μ,σ2) Z=(X-μ)/σ -3 -2 - X= μ+Zσ + +2 +3 x 20 Areas under the Normal Curve S(-, )=1 +1 +3)=0.6587 +2 )=0.9987 )=0.9772 S(-, )=0.5 -3)=0.1587 -2 -1 )=0.0013 )=0.0228 21 -3 -2 - -4 -3 -2 -1 0 + +2 +3 1 2 3 4 x Z Area Under Normal Curve S(-, -3)=0.0013 S(-3, -2)=0.0115 S(-, -2)=0.0228 S(-2, -1)=0.1359 S(-, -1)=0.1587 S(-1, )=0.3413 S(-, -0)=0.5 -3 -3 -2 - -3 - -2 + +2 -2 -1 0 1 2 + +3 +2 +3 3 22 Area Under Normal Curve 95% 2.5% 2.5% -1.96 +1.96 23 -3 -2 -1 0 1 2 3 Area Under Normal Curve 90% 5% 5% -1.64 +1.64 24 -3 -2 -1 0 1 2 3 Area Under Normal Curve 99% 0.5% 0.5% -2.58 +2.58 25 -3 -2 -1 0 1 2 3 Area Under Normal Curve 95% 2.5% 2.5% -1.96 +1.96 26 -3 -2 -1 0 1 2 3 95% heights of females will fall in the range between mean -1.96SD and mean +1.96SD and Z score, Standard Score Transform N(,2) to N(0,1z is refer to as Standard Normal score How many SD’s the observation from the mean? Transformation of a normal distribution such that the units are in SD’s. (z score, Standard Score) By the units of SD, we can compare the observations from diff population. A female with height 172 cm a male with height 172 cm 28 Values of variable & area under curve Observation distributed as AUC Standard normal score (Z) normal (x) (probability) μ-1σ~μ+1σ -1~+1 68.27% μ-1.96σ~μ+1.96σ -1.96~1.96 95.00% μ-2.58σ~μ+2.58σ -2.58~2.58 99.00% The area that falls in the interval under the nonstandard normal curve is the same as that under the standard normal curve within the corresponding u-boundaries. 29 The Most Important Distribution In practice Many real life distributions are approximately normal, such as height, weight, IQ, GB and so on In theory Many other distributions can be almost normalized by appropriate data transformation (e.g. taking the log); •30 30 Summarizing The fundamental probability distribution of statistics. A very important distribution both in theory and in practice. The normal distribution has a set of curves. Defined by mean and SD. (infinite) N(0,1) is unique. The areas under normal curve are equal when measured by standard deviation. 31 Applications of Normal distribution Estimate frequency distribution Estimate Reference Range 32 Estimate frequency distribution Example: IF the distribution of birth weights follows a normal distribution with mean 3150g, and standard deviation is 350g。 To estimate what proportion of infants whose birth weight are less than 2500g? 33 Solve for the Example: The standard normal deviate if x=2500: Z=(x-3150)/350=-1.86 The probability when Z<-1.86 under the standard normal distribution : ϕ(-1.86)=P(z<-1.86)=0.0314 Result: there are about 3.14% infants whose birth weight are less than 2500g. 34 Estimate Frequency Distribution 0.0314 2500 3150 35 Using Normal Distribution For any variables distributed as normal distribution, 95% individuals assume values between μ-1.96σ~μ+1.96σ; 99% between -2.58~ +2.58 ; And so on. 36 Reference Interval( Range) In health-related fields, a reference range or reference interval usually describes the variations of a measurement or value in healthy individuals. It is a basis for a physician or other health professional to interpret a set of results for a particular patient. The standard definition of a reference range (usually referred to if not otherwise specified) basically originates in what is most prevalent in a reference group taken from the population. However, there are also optimal health ranges that are those that appear to have the optimal health impact on people. Reference Interval( Range) What is ? A range of values within which majority of measurements from “normal” subjects will lie. Majority: 90%,95%,99%, etc.。 Usage: Used as the basis for assessing the result of diagnostic tests in clinic. (normal? abnormal?) Definitions of “Normal subject”: Normal Healthy maybe suffer from other diseases, but do not influence the variable we studied. 38 How to estimate a reference interval? Homogeneity of normal subjects. 100 Measurement errors are controlled One side? Two sides? Majority? 90%,95%? Is it necessary to estimated RI in subgroups? (considerations of partitioning based on age, sex etc) Determine the suspect range if necessary 39 Two-side or One-side Determined by medical professional. Two-side: WBC, BP, serum total cholesterol, …… One-side: Upper Limit : urine Ld, hair Hg, …Normal as long as lower than Low Limit: Vital Capacity, IQ, FEV1 (forced expiratory volume in one second) Normal as long as great than 40 Overlap distributed of observations for Normal and Abnormal (one-side) Normal Subject False-negative rate False-positive rate Abnormal 界值 41 Overlap distributed of observations for Normal and Abnormal (one-side) Normal Subject False-negative rate False-positive rate Abnormal 42 Overlap distributed of observations for Normal and Abnormal (two-side) False-negative rate False-positive rate Normal Subject Abnormal Abnormal 43 Normal approximate method For normally distributed data A 95% reference interval Two-side: One-side: X 1.96 s For upper limit: For low limit: X 1.64 s X 1.64 s Percentile Method For non-normally distributed data A 95% reference interval Two-side: One-side: P2.5 ~ P97.5 For upper limit: For low limit: <P95 >P5 45 Example Hb (hemoglobin) for 360 normal male. The mean is 13.45 g/100ml; The standard deviation is 0.71 g/100ml; Hb is normally distributed. Estimate the 95% reference range and the 90% reference range. 46 Example (cont.) Two side X 1.96 s X 1.96 s 13.45 1.96 0.71 12.06 (g/100ml) X 1.96 s 13.45 1.96 0.71 14.84 (g/100ml ) The 95% reference range is 12.06~14.84 (g/100ml) 47 Example (cont.) Two side X 1.64 s X 1.64 s 13.45 1.64 0.71 12.29 (g/100ml) X 1.64 s 13.45 1.64 0.71 14.61 (g/100ml) The 90% reference range is 12.29~14.61 (g/100ml) The 95% reference range is 12.06~14.84 (g/100ml) 48 Two methods for reference intervals. Method two-side One-side Low Normal Percentile X u / 2 s X u s P2.5~P97.5 >P5 Upper X u s <P95 49 Central Limit Theorem As a sample size increased, the means of samples drawn from a population of and distribution will approach the normal distribution. This theorem is known as the central limit theorem (CLT). That is Sampling distributions Probability and the central limit theorem 50 Sampling distribution A sampling distribution is the probability distribution of a sample statistic that is formed when samples of size n are repeatedly taken from a population. The sampling distribution of sample means 51 Binomial Distribution Probability Model for discrete data 52 Review binary qualitative data rate-incidence /proportion-prevalence 53 Tossing coin What’s the probability that you flip exactly 3 heads in 5 coin tosses? 54 •P(3 heads & 2 tails) =5C3*P(heads)3*P(tails)2 •=10*(0.5)3(0.5)2=31.25% • 3 5 • ways to arrange 3 heads in 5 trials Outcome Probability THHHT (1/2)3 * (1/2)2 HHHTT (1/2)3 * (1/2)2 TTHHH (1/2)3 * (1/2)2 HTTHH (1/2)3 * (1/2)2 HHTTH (1/2)3 * (1/2)2 HTHHT (1/2)3 * (1/2)2 THTHH (1/2)3 * (1/2)2 HTHTH (1/2)3 * (1/2)2 HHTHT (1/2)3 * (1/2)2 THHTH (1/2)3 * (1/2)2 •5C3 = 5!/3!2! = 10 10 arrangements •The probability of each unique outcome (note: they are all equal) (1/2)3 *(1/2)2 •Factorial review: n! = n(n-1)(n-2)… 55 Binomial distribution function: X= the number of heads tossed in 5 coin tosses •p(x) •0 •1 •2 •3 •4 •5 •x •number of heads 56 Example for side effect of drug if a certain drug is known to cause a side effect 10% of the time and if five patients are given this drug, what is the probability that four or more experience the side effect? Let S denote a side-effect outcome and N an outcome without side effects. 57 Table 58 Solution to example The probability of obtaining an outcome with four S’s and one N is The probability of obtaining all five S’s is the probability of the compound event that ‘‘four or more have side effects is 59 probability density function(PDF) The model is concerned with the total number of successes in n trials as a random variable, denoted by X. Its probability density function is given by the number of combinations of x objects selected from a set of n 60 Assumptions for Binomial Distribution The experiment consists of n repeated trials satisfying these assumptions: 1. The n trials are all independent. 2. The parameter p of one in 2 is the same for each trial. 61 The mean and variance of the binomial distribution x n x n (1 ) 2 x n (1 ) when the number of trials n is from moderate to large (n > 25, say), we approximate the binomial distribution by a normal distribution and answer probability questions by first converting to a standard normal score: where π is the probability of having a positive outcome from a single trial 62 Solution to Example For π =0.1 and n =30, we have 63 PDF •n••=20• ••=0.5• •0•.4• •n••=5• ••=0.3• •n••=10• ••=0.3• •n••=30• ••=0.3• P•(•X•)• •0•.3• •0•.2• •0•.1• •0•.0• •4• •8• •12• •16• •0• •2• •4• •0• •2• •4• •6• •X• •4• •8• •12• •16• 65 Review – experiment & survey 2 type of researches_ experimental and observational research Clinical trial (4 phases) Statistical consideration in clinical trial Controlled /Randomization/blindness/ replication (appropriate sample size). probabilistic sampling techniques 66 Review on idea of probability 67 Idea of probability Definitions of probability Classic probability- If a random experiment can result in n possible mutually exclusive and equally likely outcomes and if nA of these outcomes have an attribute A, then the probability (Pr) of A is written as nA /n Statistical probability-If an experiment is performed n times and if nA of these result in the outcome A, then the probability of A occurring is defined as the limiting ratio: P(A)=nA/n Subjective probability-Probability represents one’s belief regarding the likelihood of an outcome A occurring Probability of Event = p 0 <= p <= 1 68 Rule for Computing If A and B have no outcomes in common, they can not occur simultaneously, they are Mutually Exclusive events P(A or B) = P(A) + P(B) if events A & B are independent then P(A&B) = P(A)*P(B) 69 Conditional Probability Concern the odds of one event occurring, given that another event has occurred P(A|B)=Prob of A, given B if A and B are independent, then P(B|A) = P(A)*P(B)/P(A) P(B|A) = P(B) 70 Percentile calculation 71 Quartiles Quartiles divide data into four equal parts First quartile—Q1 25% of observations are below Q1 and 75% above Q1 Also called the lower quartile Second quartile—Q2 50% of observations are below Q2 and 50% above Q2 This is also the median Third quartile—Q3 75% of observations are below Q3 and 25% above Q3 Also called the upper quartile Calculating percentiles Example The sorted observations are: 2,5, 9, 12, 14,15,18,24,60,find the median and P20. Solution The number of observations n = 9 12-73 P50 X ( n1)50% X 5 14 PX X ( n 1) X % P20 X ( n1)20% X 2 5 i (n 1) X % Calculating percentiles The sorted observations are: 4, 9, 10, 12,14,20,24,61, Find the median and P20. (n+1)*20%=1.8 PX X ( n 1) X % PX X j ( X j 1 X j ) (i j ) P20 X 1 ( X 2 X 1 ) (1.8 1) 4 (9 4) 0.8 8 12-74 •Calculation of percentile from a grouped frequency table •Example: The frequency distribution for the systolic blood pressure readings (in mm or mercury) of 200 randomly selected college students is shown here. Boundaries Frequency cumulative frequency cumulative percent(%) 89.5- 24 24 12 104.5- 62 86 43 119.5- 72 158 79 134.5- 26 184 92 149.5- 12 196 95 164.5- 4 200 100 The class interval that contains the relevant quartile is called the quartile class 75 Calculation of quartiles from a grouped frequency table n C 4 i Q1 L f 3n C 4 i Q3 L f where: L = the real lower limit of the quartile class (containing Q1 or Q3) n = Σf = the total number of observations in the entire data set C = the cumulative frequency in the class immediately before the quartile class f = the frequency of the relevant quartile class i = the length of the real class interval of the relevant quartile class Q1 P25 104.8 111.09 12-76 50 24 15 62 •Calculation of percentile from a grouped frequency table class Frequency (f) cumulative frequency(C) cumulative percent(%) 89.5- 24 24 12 104.5- 62 86 43 119.5- 72 158 79 134.5- 26 184 92 149.5- 12 196 95 164.5- 4 200 100 n C 4 i Q1 L f The class interval that contains the relevant quartile is called the quartile class Q1 P25 104.8 50 24 15 111.09 62 77 •Calculation of quartiles from a grouped frequency table PX n X % C i L f 78