Download 05 Normal distribution and binomial distri

Lecture 5 probability model normal distribution & binomial distribution [email protected] Contents  Normal distribution for continuous data  Binomial distribution for binary categorical data 2 The Normal Distribution The most important distribution in statistics. Normal distribution  Introduction to normal distribution     History Parameters and shape standard normal distribution and Z score Area under the curve  Application  Estimate of frequency distribution  Reference interval (range) in health_related field. 4 histroy-Normal Distribution      Johann Carl Friedrich Gauss Germany One of the greatest mathematician Applied in physics, astronomy Gaussian distribution (1777~1855) Mark and Stamp in memory of Gauss. 5 The Most Important Distribution  Many real life distributions are approximately normal. such as height, EFV1,weight, IQ, and so on.  Many other distributions can be almost normalized by appropriate data transformation (e.g. taking the log). When log X has a normal distribution, X is said to have a lognormal distribution. 6 Frequency distributions of heights of adult men. (a) (b) (c) (d) 7 Sample & Population  Histogram the area of the bars  Cumulative relative frequency  in the sample, the proportion of the boys of age 12 that are lower than a specified height.      normal distribution curve The area under the curve The cumulative probability. In the population. Generally speaking, the chance that a boy of aged 12 is lower than a specified height if he grow normally 8 Definition of Normal distribution  X ～ N(,2), X is distributed as normal distribution with mean  and variance 2.  The probability density function (PDF) f (x) for a normal distribution is given by f (X)  1  2  e ( X   )2 2 2 (- < X < +) Where: e = 2.7182818285, base of natural logarithm  = 3.1415926536, ratio of the circumference of a circle to the diameter. 9 The shape of a normal distribution .4 f (X)  f(x) 1  2  e ( X   )2 2 2 .3 .2 .1 0 x 10 The normal distributions with the equal variance but different means 3 1 2 11 The normal distributions with the same mean but different variances 2 1 3  12 Properties Of Normal Distribution  & completely determine the characterization of the normal distribution.  Mean, median , mode are equal  The curve is symmetric about mean.  The relationship between  and the area under the normal curve provides another main characteristic of the normal distribution. 13 Areas under the Standard Normal Curve  A variable that has a normal distribution with mean 0 and variance 1 is called the standard normal variate and is commonly designated by the letter Z.  N(0,1)  As with any continuous variable, probability calculations here are always concerned with finding the probability that the variable assumes any value in an interval between two specific points a and b. 14 Cumulative distribution Function (  the area under the curve) from -∞ to x, cumulative Probability S(-, )=1  Example: What is the probability of obtaining a z value of 0.5 or less?  We have 15 Area under standard normal distribution (Z) Z -3.0 -2.5 -2.0 -1.9 -1.6 -1.0 -0.5 0 0.00 0.0013 0.0062 0.0228 0.0287 0.0548 0.1587 0.3085 0.5000 -0.02 0.0013 0.0059 0.0217 0.0274 0.0526 0.1539 0.3015 0.4920 -0.04 0.0012 0.0055 0.0207 0.0262 0.0505 0.1492 0.2946 0.4840 -0.06 0.0011 0.0052 0.0197 0.0250 0.0485 0.1446 0.2877 0.4761 -0.08 0.0010 0.0049 0.0188 0.0239 0.0465 0.1401 0.2810 0.4681 Z 0 Z is the standard score, that is the units of standard deviation. 16 Figure Standard normal curve and some important divisions. •P(-1 < z < 1)=0.6826 •P(-2 < z < 2)=0.9545 •P(-3 < z < 3)=0.9974 17 Find probability in Excel  Using an electronic table, find the area under the standard normal density to the left of 2.824.  We use the excel2007 function NORMSDIST evaluated at 2.824 [NORMSDIST(2.824)]with the result as follows: 18 EXAMPLE  What is the probability of obtaining a z value between 1.0 and 1.58?  We have 19 CUMULATIVE PROBABILITY FOR X~N(μ,σ2)  Z=(X-μ)/σ -3 -2 - X= μ+Zσ  + +2 +3 x 20 Areas under the Normal Curve S(-, )=1 +1 +3)=0.6587 +2 )=0.9987 )=0.9772 S(-, )=0.5 -3)=0.1587 -2 -1 )=0.0013 )=0.0228 21 -3 -2 - -4 -3 -2 -1  0 + +2 +3 1 2 3 4 x Z Area Under Normal Curve S(-, -3)=0.0013 S(-3, -2)=0.0115 S(-, -2)=0.0228 S(-2, -1)=0.1359 S(-, -1)=0.1587 S(-1,  )=0.3413 S(-, -0)=0.5 -3 -3 -2 -  -3 - -2 + +2 -2 -1 0 1 2  + +3 +2 +3 3 22 Area Under Normal Curve 95% 2.5% 2.5%  -1.96 +1.96 23 -3 -2 -1 0 1 2 3 Area Under Normal Curve 90% 5% 5%  -1.64 +1.64 24 -3 -2 -1 0 1 2 3 Area Under Normal Curve 99% 0.5% 0.5% -2.58 +2.58  25 -3 -2 -1 0 1 2 3 Area Under Normal Curve 95% 2.5% 2.5%  -1.96 +1.96 26 -3 -2 -1 0 1 2 3 95% heights of females will fall in the range between mean -1.96SD and mean +1.96SD and Z score, Standard Score  Transform N(,2) to N(0,1z is refer to as Standard Normal score  How many SD’s the observation from the mean?  Transformation of a normal distribution such that the units are in SD’s. (z score, Standard Score)  By the units of SD, we can compare the observations from diff population. A female with height 172 cm a male with height 172 cm 28 Values of variable & area under curve Observation distributed as AUC Standard normal score (Z) normal (x) （probability） μ-1σ～μ＋1σ -1～＋1 68.27% μ-1.96σ～μ＋1.96σ -1.96~1.96 95.00% μ-2.58σ～μ＋2.58σ -2.58~2.58 99.00% The area that falls in the interval under the nonstandard normal curve is the same as that under the standard normal curve within the corresponding u-boundaries. 29 The Most Important Distribution  In practice Many real life distributions are approximately normal, such as height, weight, IQ, GB and so on  In theory Many other distributions can be almost normalized by appropriate data transformation (e.g. taking the log); •30 30 Summarizing  The fundamental probability distribution of statistics.  A very important distribution both in theory and in practice.  The normal distribution has a set of curves. Defined by mean and SD. (infinite)  N(0,1) is unique.  The areas under normal curve are equal when measured by standard deviation. 31 Applications of Normal distribution  Estimate frequency distribution  Estimate Reference Range 32 Estimate frequency distribution Example:  IF the distribution of birth weights follows a normal distribution with mean 3150g, and standard deviation is 350g。  To estimate what proportion of infants whose birth weight are less than 2500g? 33 Solve for the Example:  The standard normal deviate if x=2500: Z=(x-3150)/350=-1.86  The probability when Z<-1.86 under the standard normal distribution : ϕ(-1.86)=P(z<-1.86)=0.0314  Result: there are about 3.14% infants whose birth weight are less than 2500g. 34 Estimate Frequency Distribution 0.0314 2500 3150  35 Using Normal Distribution  For any variables distributed as normal distribution, 95% individuals assume values between μ-1.96σ～μ＋1.96σ;  99% between -2.58～ +2.58 ;  And so on. 36 Reference Interval( Range)  In health-related fields, a reference range or reference interval usually describes the variations of a measurement or value in healthy individuals.  It is a basis for a physician or other health professional to interpret a set of results for a particular patient.  The standard definition of a reference range (usually referred to if not otherwise specified) basically originates in what is most prevalent in a reference group taken from the population. However, there are also optimal health ranges that are those that appear to have the optimal health impact on people. Reference Interval( Range)  What is ?  A range of values within which majority of measurements from “normal” subjects will lie.  Majority: 90%，95%，99%, etc.。  Usage：  Used as the basis for assessing the result of diagnostic tests in clinic. (normal? abnormal?)  Definitions of “Normal subject”:  Normal  Healthy  maybe suffer from other diseases, but do not influence the variable we studied. 38 How to estimate a reference interval?      Homogeneity of normal subjects. 100 Measurement errors are controlled One side? Two sides? Majority? 90%,95%? Is it necessary to estimated RI in subgroups? (considerations of partitioning based on age, sex etc)  Determine the suspect range if necessary 39 Two-side or One-side  Determined by medical professional.  Two-side:  WBC, BP, serum total cholesterol, ……  One-side:  Upper Limit : urine Ld, hair Hg, …Normal as long as lower than  Low Limit: Vital Capacity, IQ, FEV1 (forced expiratory volume in one second)  Normal as long as great than 40 Overlap distributed of observations for Normal and Abnormal (one-side) Normal Subject False-negative rate False-positive rate Abnormal 界值 41 Overlap distributed of observations for Normal and Abnormal (one-side) Normal Subject False-negative rate False-positive rate Abnormal 42 Overlap distributed of observations for Normal and Abnormal (two-side) False-negative rate False-positive rate Normal Subject Abnormal Abnormal 43 Normal approximate method  For normally distributed data  A 95% reference interval  Two-side:  One-side: X  1.96 s For upper limit: For low limit: X  1.64 s X  1.64 s Percentile Method  For non-normally distributed data  A 95% reference interval  Two-side:  One-side: P2.5 ~ P97.5 For upper limit: For low limit: <P95 >P5 45 Example  Hb (hemoglobin) for 360 normal male.  The mean is 13.45 g/100ml;  The standard deviation is 0.71 g/100ml;  Hb is normally distributed.  Estimate the 95% reference range and the 90% reference range. 46 Example (cont.)  Two side X  1.96 s X  1.96 s  13.45  1.96  0.71  12.06 (g/100ml) X  1.96 s  13.45  1.96  0.71  14.84 (g/100ml )  The 95% reference range is 12.06～14.84 (g/100ml) 47 Example (cont.)  Two side X  1.64 s X  1.64 s  13.45  1.64  0.71  12.29 (g/100ml) X  1.64 s  13.45  1.64  0.71  14.61 (g/100ml) The 90% reference range is 12.29～14.61 (g/100ml) The 95% reference range is 12.06～14.84 (g/100ml) 48 Two methods for reference intervals. Method two-side One-side Low Normal Percentile X  u / 2 s X  u s P2.5～P97.5 >P5 Upper X  u s <P95  49 Central Limit Theorem  As a sample size increased, the means of samples drawn from a population of and distribution will approach the normal distribution. This theorem is known as the central limit theorem (CLT).  That is Sampling distributions  Probability and the central limit theorem 50 Sampling distribution  A sampling distribution is the probability distribution of a sample statistic that is formed when samples of size n are repeatedly taken from a population.  The sampling distribution of sample means 51 Binomial Distribution Probability Model for discrete data 52 Review  binary qualitative data  rate-incidence /proportion-prevalence 53 Tossing coin  What’s the probability that you flip exactly 3 heads in 5 coin tosses? 54 •P(3 heads & 2 tails) =5C3*P(heads)3*P(tails)2 •=10*(0.5)3(0.5)2=31.25% •      3 5 • ways to arrange 3 heads in 5 trials Outcome Probability THHHT (1/2)3 * (1/2)2 HHHTT (1/2)3 * (1/2)2 TTHHH (1/2)3 * (1/2)2 HTTHH (1/2)3 * (1/2)2 HHTTH (1/2)3 * (1/2)2 HTHHT (1/2)3 * (1/2)2 THTHH (1/2)3 * (1/2)2 HTHTH (1/2)3 * (1/2)2 HHTHT (1/2)3 * (1/2)2 THHTH (1/2)3 * (1/2)2 •5C3 = 5!/3!2! = 10 10 arrangements •The probability of each unique outcome (note: they are all equal) (1/2)3 *(1/2)2 •Factorial review: n! = n(n-1)(n-2)… 55 Binomial distribution function: X= the number of heads tossed in 5 coin tosses •p(x) •0 •1 •2 •3 •4 •5 •x •number of heads 56 Example for side effect of drug  if a certain drug is known to cause a side effect 10% of the time and if five patients are given this drug, what is the probability that four or more experience the side effect?  Let S denote a side-effect outcome and N an outcome without side effects. 57 Table 58 Solution to example  The probability of obtaining an outcome with four S’s and one N is  The probability of obtaining all five S’s is  the probability of the compound event that ‘‘four or more have side effects is 59 probability density function(PDF)  The model is concerned with the total number of successes in n trials as a  random variable, denoted by X. Its probability density function is given by the number of combinations of x objects selected from a set of n 60 Assumptions for Binomial Distribution  The experiment consists of n repeated trials satisfying these assumptions:  1. The n trials are all independent.  2. The parameter p of one in 2 is the same for each trial. 61 The mean and variance of the binomial distribution  x  n  x  n (1   ) 2  x  n (1   ) when the number of trials n is from moderate to large (n > 25, say), we approximate the binomial distribution by a normal distribution and answer probability questions by first converting to a standard normal score:  where π is the probability of having a positive outcome from a single trial 62 Solution to Example  For π =0.1 and n =30, we have 63 PDF •n••=20• ••=0.5• •0•.4• •n••=5• ••=0.3• •n••=10• ••=0.3• •n••=30• ••=0.3• P•(•X•)• •0•.3• •0•.2• •0•.1• •0•.0• •4• •8• •12• •16• •0• •2• •4• •0• •2• •4• •6• •X• •4• •8• •12• •16• 65 Review – experiment & survey  2 type of researches_ experimental and observational research  Clinical trial (4 phases)  Statistical consideration in clinical trial  Controlled /Randomization/blindness/ replication (appropriate sample size).  probabilistic sampling techniques 66 Review on idea of probability 67 Idea of probability  Definitions of probability    Classic probability- If a random experiment can result in n possible mutually exclusive and equally likely outcomes and if nA of these outcomes have an attribute A, then the probability (Pr) of A is written as nA /n Statistical probability-If an experiment is performed n times and if nA of these result in the outcome A, then the probability of A occurring is defined as the limiting ratio: P(A)=nA/n Subjective probability-Probability represents one’s belief regarding the likelihood of an outcome A occurring  Probability of Event = p 0 <= p <= 1 68 Rule for Computing  If A and B have no outcomes in common, they can not occur simultaneously, they are Mutually Exclusive events P(A or B) = P(A) + P(B)  if events A & B are independent then P(A&B) = P(A)*P(B) 69 Conditional Probability  Concern the odds of one event occurring, given that another event has occurred  P(A|B)=Prob of A, given B  if A and B are independent, then P(B|A) = P(A)*P(B)/P(A) P(B|A) = P(B) 70 Percentile calculation 71 Quartiles  Quartiles divide data into four equal parts  First quartile—Q1   25% of observations are below Q1 and 75% above Q1 Also called the lower quartile  Second quartile—Q2   50% of observations are below Q2 and 50% above Q2 This is also the median  Third quartile—Q3   75% of observations are below Q3 and 25% above Q3 Also called the upper quartile Calculating percentiles Example The sorted observations are: 2,5, 9, 12, 14,15,18,24,60,find the median and P20. Solution The number of observations n = 9 12-73 P50  X ( n1)50%  X 5  14 PX  X ( n 1) X % P20  X ( n1)20%  X 2  5 i  (n  1)  X % Calculating percentiles The sorted observations are: 4, 9, 10, 12,14,20,24,61, Find the median and P20.  (n+1)*20%=1.8 PX  X ( n 1) X % PX  X j  ( X j 1  X j )  (i  j ) P20  X 1  ( X 2  X 1 )  (1.8  1)  4  (9  4)  0.8  8 12-74 •Calculation of percentile from a grouped frequency table •Example: The frequency distribution for the systolic blood pressure readings (in mm or mercury) of 200 randomly selected college students is shown here. Boundaries Frequency cumulative frequency cumulative percent(%) 89.5- 24 24 12 104.5- 62 86 43 119.5- 72 158 79 134.5- 26 184 92 149.5- 12 196 95 164.5- 4 200 100 The class interval that contains the relevant quartile is called the quartile class 75 Calculation of quartiles from a grouped frequency table n   C 4  i  Q1  L   f  3n    C 4  i  Q3  L   f where: L = the real lower limit of the quartile class (containing Q1 or Q3) n = Σf = the total number of observations in the entire data set C = the cumulative frequency in the class immediately before the quartile class f = the frequency of the relevant quartile class i = the length of the real class interval of the relevant quartile class Q1  P25  104.8   111.09 12-76 50  24 15  62 •Calculation of percentile from a grouped frequency table class Frequency (f) cumulative frequency(C) cumulative percent(%) 89.5- 24 24 12 104.5- 62 86 43 119.5- 72 158 79 134.5- 26 184 92 149.5- 12 196 95 164.5- 4 200 100 n   C 4  i  Q1  L   f The class interval that contains the relevant quartile is called the quartile class Q1  P25  104.8  50  24 15   111.09 62 77 •Calculation of quartiles from a grouped frequency table PX  n X %  C i   L f 78

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 05 Normal distribution and binomial distri