Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 4 Basic Probability and Probability Distributions Probability Terminology • Classical Interpretation: Notion of probability based on equal likelihood of individual possibilities (coin toss has 1/2 chance of Heads, card draw has 4/52 chance of an Ace). Origins in games of chance. – Outcome: Distinct result of random process (N= # outcomes) – Event: Collection of outcomes (Ne= # of outcomes in event) – Probability of event E: P(event E) = Ne/N • Relative Frequency Interpretation: If an experiment were conducted repeatedly, what fraction of time would event of interest occur (based on empirical observation) • Subjective Interpretation: Personal view (possibly based on external info) of how likely a one-shot experiment will end in event of interest Obtaining Event Probabilities • Classical Approach – List all N possible outcomes of experiment – List all Ne outcomes corresponding to event of interest (E) – P(event E) = Ne/N • Relative Frequency Approach – Define event of interest – Conduct experiment repeatedly (often using computer) – Measure the fraction of time event E occurs • Subjective Approach – Obtain as much information on process as possible – Consider different outcomes and their likelihood – When possible, monitor your skill (e.g. stocks, weather) Basic Probability and Rules • • • • A,B Events of interest P(A), P(B) Event probabilities Union: Event either A or B occurs (A B) Mutually Exclusive: A, B cannot occur at same time – If A,B are mutually exclusive: P(either A or B) = P(A) + P(B) • Complement of A: Event that A does not occur (Ā) – P(Ā) = 1- P(A) That is: P(A) + P(Ā) = 1 • Intersection: Event both A and B occur (A B or AB) • P (A B) = P(A) + P(B) - P(AB) Conditional Probability and Independence • Unconditional/Marginal Probability: Frequency which event occurs in general (given no additional info). P(A) • Conditional Probability: Probability an event (A) occurs given knowledge another event (B) has occurred. P(A|B) • Independent Events: Events whose unconditional and conditional (given the other) probabilities are the same P( A B) P( AB) P( A | B) P( B) P( B) P( A B) P( AB) P( B | A) P( A) P( A) P( A B) P( AB) P( A) P( B | A) P( B) P( A | B) A, B independent P( A) P( A | B) & P( B) P( B | A) John Snow London Cholera Death Study • 2 Water Companies (Let D be the event of death): – Southwark&Vauxhall (S): 264913 customers, 3702 deaths – Lambeth (L): 171363 customers, 407 deaths – Overall: 436276 customers, 4109 deaths 4109 .0094 (94 per 10000 people) 436276 3702 P( D | S ) .0140 (140 per 10000 people) 264913 407 P ( D | L) .0024 (24 per 10000 people) 171363 P( D) Note that probability of death is almost 6 times higher for S&V customers than Lambeth customers (was important in showing how cholera spread) John Snow London Cholera Death Study Water Company S&V Lambeth Total Cholera Death Yes No Total 3702 (.0085) 407 (.0009) 4109 (.0094) 261211 (.5987) 170956 (.3919) 432167 (.9906) 264913 (.6072) 171363 (.3928) 436276 (1.0000) ( Contingency Table with joint probabilities (in body of table) and marginal probabilities (on edge of table) John Snow London Cholera Death Study Company Death .0140 D (.0085) S&V .6072 .9860 DC (.5987) WaterUser .0024 .3928 L .9976 D (.0009) DC (.3919) Tree Diagram obtaining joint probabilities by multiplication rule Bayes’s Rule - Updating Probabilities • Let A1,…,Ak be a set of events that partition a sample space such that (mutually exclusive and exhaustive): – each set has known P(Ai) > 0 (each event can occur) – for any 2 sets Ai and Aj, P(Ai and Aj) = 0 (events are disjoint) – P(A1) + … + P(Ak) = 1 (each outcome belongs to one of events) • If C is an event such that – 0 < P(C) < 1 (C can occur, but will not necessarily occur) – We know the probability will occur given each event Ai: P(C|Ai) • Then we can compute probability of Ai given C occurred: P(C | Ai ) P( Ai ) P( Ai and C ) P( Ai | C ) P(C | A1 ) P( A1 ) P(C | Ak ) P( Ak ) P(C ) Northern Army at Gettysburg Regiment I Corps II Corps III Corps V Corps VI Corps XI Corps XII Corps Cav Corps Arty Reserve Sum Label A1 A2 A3 A4 A5 A6 A7 A8 A9 Initial # 10022 12884 11924 12509 15555 9839 8589 11501 2546 95369 Casualties 6059 4369 4211 2187 242 3801 1082 852 242 23045 P(Ai) 0.1051 0.1351 0.1250 0.1312 0.1631 0.1032 0.0901 0.1206 0.0267 1 P(C|Ai) 0.6046 0.3391 0.3532 0.1748 0.0156 0.3863 0.1260 0.0741 0.0951 P(C|Ai)*P(Ai) 0.0635 0.0458 0.0442 0.0229 0.0025 0.0399 0.0113 0.0089 0.0025 0.2416 P(C) P(Ai|C) 0.2630 0.1896 0.1828 0.0949 0.0105 0.1650 0.0470 0.0370 0.0105 1.0002 • Regiments: partition of soldiers (A1,…,A9). Casualty: event C • P(Ai) = (size of regiment) / (total soldiers) = (Column 3)/95369 • P(C|Ai) = (# casualties) / (regiment size) = (Col 4)/(Col 3) • P(C|Ai) P(Ai) = P(Ai and C) = (Col 5)*(Col 6) •P(C)=sum(Col 7) • P(Ai|C) = P(Ai and C) / P(C) = (Col 7)/.2416 Example - OJ Simpson Trial • Given Information on Blood Test (T+/T-) – Sensitivity: P(T+|Guilty)=1 – Specificity: P(T-|Innocent)=.9957 P(T+|I)=.0043 • Suppose you have a prior belief of guilt: P(G)=p* • What is “posterior” probability of guilt after seeing evidence that blood matches: P(G|T+)? P(T ) P(T G ) P(T I ) P (G ) P (T | G ) P ( I ) P (T | I ) p *(1) (1 p*)(.0043) P(T G ) P (G ) P (T | G ) p *(1) p* P(G | T ) P(T ) P (T ) p *(1) (1 p*)(.0043) .9957 p * .0043 Source: B.Forst (1996). “Evidence, Probabilities and Legal Standards for Determination of Guilt: Beyond the OJ Trial”, in Representing OJ: Murder, Criminal Justice, and the Mass Culture, ed. G. Barak pp. 22-28. Harrow and Heston, Guilderland, NY Probability OJ is Guilty Given He Tested Positive 1 0.9 0.8 0.7 P(G|T+) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 p* 0.6 0.7 0.8 0.9 1 Random Variables/Probability Distributions • Random Variable: Outcome characteristic that is not known prior to experiment/observation • Qualitative Variables: Characteristics that are nonnumeric (e.g. gender, race, religion, severity) • Quantitative Variables: Characteristics that are numeric (e.g. height, weight, distance) – Discrete: Takes on only a countable set of possible values – Continuous: Takes on values along a continuum • Probability Distribution: Numeric description of outcomes of a random variable takes on, and their corresponding probabilities (discrete) or densities (continuous) Discrete Random Variables • Discrete RV: Can take on a finite (or countably infinite) set of possible outcomes • Probability Distribution: List of values a random variable can take on and their corresponding probabilities – Individual probabilities must lie between 0 and 1 – Probabilities sum to 1 • Notation: – – – – Random variable: Y Values Y can take on: y1, y2, …, yk Probabilities: P(Y=y1) = p1 … P(Y=yk) = pk p1 + … + pk = 1 Example: Wars Begun by Year (1482-1939) Distribution of Numbers of wars started by year Y = # of wars stared in randomly selected year Levels: y1=0, y2=1, y3=2, y4=3, y5=4 Probability Distribution: Histogram #Wars 0 1 2 3 4 Probability 0.5284 0.3231 0.1070 0.0328 0.0087 Yearr • • • • 300 200 100 0 0 1 2 3 Wars 4 More Masters Golf Tournament 1st Round Scores Histogram Score 90 87 84 81 78 75 72 69 66 600 500 400 300 200 100 0 63 Frequency Score Frequency Probability 63 1 0.000288 64 2 0.000576 65 6 0.001728 66 16 0.004608 67 46 0.013249 68 67 0.019297 69 151 0.043491 70 238 0.068548 71 337 0.097062 72 428 0.123272 73 467 0.134505 74 498 0.143433 75 397 0.114343 76 293 0.084389 77 203 0.058468 78 125 0.036002 79 78 0.022465 80 50 0.014401 81 28 0.008065 82 17 0.004896 83 7 0.002016 84 7 0.002016 85 4 0.001152 86 3 0.000864 87 1 0.000288 88 2 0.000576 Means and Variances of Random Variables • Mean: Long-run average a random variable will take on (also the balance point of the probability distribution) • Expected Value is another term, however we really do not expect that a realization of X will necessarily be close to its mean. Notation: E(X) • Mean and Variance of a discrete random variable: E (Y ) Y y1 p1 y2 p2 yk pk yi pi V (Y ) E (Y ) ( yi ) pi y pi 2 2 2 i 2 Rules for Means • Linear Transformations: a + bY (where a and b are constants): E(a+bY) = a+bY = a + bY • Sums of random variables: X + Y (where X and Y are random variables): E(X+Y) = X+Y = X + Y • Linear Functions of Random Variables: E(a1Y1++anYn) = a11+…+ann where E(Yi)=i Example: Masters Golf Tournament • Mean by Round (Note ordering): 1=73.54 2=73.07 3=73.76 4=73.91 Mean Score per hole (18) for round 1: E((1/18)X1) = (1/18)1 = (1/18)73.54 = 4.09 Mean Score versus par (72) for round 1: E(X1-72) = X1-72 = 73.54-72= +1.54 (1.54 over par) Mean Difference (Round 1 - Round 4): E(X1-X4) = 1 - 4 = 73.54 - 73.91 = -0.37 Mean Total Score: E(X1+X2+X3+X4) = 1+ 2+ 3+ 4 = = 73.54+73.07+73.76+73.91 = 294.28 (6.28 over par) Variance of a Random Variable V (a bY ) b 2 a bY 2 2 Y 2 2 2 2 2 V (aX bY ) aX a b Y 2ab X Y bY X where is the correlatio n between X and Y Special Cases: • X and Y are independent (outcome of one does not alter the distribution of the other): = 0, last term drops out • a=b=1 and = 0 V(X+Y) = X2 + Y2 • a=1 b= -1 and = 0 • a=b=1 and 0 V(X-Y) = X2 + Y2 V(X+Y) = X2 + Y2 + 2XY • a=1 b= -1 and 0 V(X-Y) = X2 + Y2 -2XY Examples - Wars & Masters Golf #Wars 0 1 2 3 4 Sum Probability 0.5284 0.3231 0.1070 0.0328 0.0087 1.0000 x*p 0.0000 0.3231 0.2140 0.0983 0.0349 0.6703 =0.67 Score 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 Sum prob 0.000288 0.000576 0.001728 0.004608 0.013249 0.019297 0.043491 0.068548 0.097062 0.123272 0.134505 0.143433 0.114343 0.084389 0.058468 0.036002 0.022465 0.014401 0.008065 0.004896 0.002016 0.002016 0.001152 0.000864 0.000288 0.000576 1 x*p 0.0181 0.0369 0.1123 0.3041 0.8877 1.3122 3.0009 4.7984 6.8914 8.8756 9.8188 10.6141 8.5757 6.4136 4.5020 2.8082 1.7748 1.1521 0.6532 0.4015 0.1673 0.1694 0.0979 0.0743 0.0251 0.0507 73.54 =73.54 Binomial Distribution for Sample Counts • Binomial “Experiment” – Consists of n trials or observations – Trials/observations are independent of one another – Each trial/observation can end in one of two possible outcomes often labelled “Success” and “Failure” – The probability of success, p, is constant across trials/observations – Random variable, Y, is the number of successes observed in the n trials/observations. • Binomial Distributions: Family of distributions for Y, indexed by Success probability (p) and number of trials/observations (n). Notation: Y~B(n,p) Binomial Distributions and Sampling • Problem when sampling from a finite population: the sequence of probabilities of Success is altered after observing earlier individuals. • When the population is much larger than the sample (say at least 20 times as large), the effect is minimal and we say X is approximately binomial • Obtaining probabilities: n y P(Y y ) P( y ) p (1 p ) n y y n n! y y!(n y )! y 0,1,, n Example - Diagnostic Test • Test claims to have a sensitivity of 90% (Among people with condition, probability of testing positive is .90) • 10 people who are known to have condition are identified, Y is the number that correctly test positive 10 k 10k P(Y k ) (.9) (.1) k k P(k) 0 1E-10 1 9E-09 10 10! k 0,1,,10 k k!(10 k )! 2 3 4 5 6 7 8 9 10 3.64E-07 8.75E-06 0.000138 0.001488 0.01116 0.057396 0.19371 0.38742 0.348678 •Table obtained in EXCEL with function: BINOMDIST(k,n,p,FALSE) (TRUE option gives cumulative distribution function: P(Yk) Binomial Mean & Standard Deviation • • • • • • • Let Si=1 if the ith individual was a success, 0 otherwise Then P(Si=1) = p and P(Si=0) = 1-p Then E(Si)=S = 1(p) + 0(1-p) = p Note that Y = S1+…+Sn and that trials are independent Then E(Y)=Y = nS = np V(Si) = E(Si2)-S2 = p-p2 = p(1-p) Then V(Y)=Y2 = np(1-p) Y ~ B(n, p ) E(Y ) Y np Y np (1 p ) For the diagnostic test : 10(0.9) 9.0 10(0.9)(0.1) 0.95 Continuous Random Variables • Variable can take on any value along a continuous range of numbers (interval) • Probability distribution is described by a smooth density curve • Probabilities of ranges of values for Y correspond to areas under the density curve – Curve must lie on or above the horizontal axis – Total area under the curve is 1 • Special case: Normal distributions Normal Distribution • Bell-shaped, symmetric family of distributions • Classified by 2 parameters: Mean () and standard deviation (). These represent location and spread • Random variables that are approximately normal have the following properties wrt individual measurements: – – – – Approximately half (50%) fall above (and below) mean Approximately 68% fall within 1 standard deviation of mean Approximately 95% fall within 2 standard deviations of mean Virtually all fall within 3 standard deviations of mean • Notation when Y is normally distributed with mean and standard deviation : Y ~ N ( , ) Two Normal Distributions Normal Distribution P(Y ) 0.50 P( Y ) 0.68 P( 2 Y 2 ) 0.95 Example - Heights of U.S. Adults • Female and Male adult heights are well approximated by normal distributions: YF~N(63.7,2.5) YM~N(69.1,2.6) 20 20 18 16 14 12 10 10 8 6 4 Std. Dev = 2.48 Std. Dev = 2.61 2 Mean = 63.7 Mean = 69.1 0 N = 99.68 55.5 57.5 56.5 59.5 58.5 61.5 60.5 63.5 62.5 65.5 64.5 67.5 66.5 69.5 68.5 70.5 N = 99.23 0 59.5 61.5 63.5 65.5 67.5 69.5 71.5 73.5 75.5 60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5 INCHESF INCHESM Cases weighted by PCTF Cases weighted by PCTM Source: Statistical Abstract of the U.S. (1992) Standard Normal (Z) Distribution • Problem: Unlimited number of possible normal distributions (- < < , > 0) • Solution: Standardize the random variable to have mean 0 and standard deviation 1 Y ~ N ( , ) Z Y ~ N (0,1) • Probabilities of certain ranges of values and specific percentiles of interest can be obtained through the standard normal (Z) distribution Standard Normal (Z) Distribution Standard Normal (Z) Distribution 0.45 0.4 0.35 Table Area 0.3 1-Table Area f(z) 0.25 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 z 1 z 2 3 4 2nd Decimal Place I n t g e r p a r t & 1st D e c i m a l z -3.0 -2.9 -2.8 -2.7 -2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2.0 -1.9 -1.8 -1.7 -1.6 -1.5 -1.4 -1.3 -1.2 -1.1 -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 -0.0 0.00 0.0013 0.0019 0.0026 0.0035 0.0047 0.0062 0.0082 0.0107 0.0139 0.0179 0.0228 0.0287 0.0359 0.0446 0.0548 0.0668 0.0808 0.0968 0.1151 0.1357 0.1587 0.1841 0.2119 0.2420 0.2743 0.3085 0.3446 0.3821 0.4207 0.4602 0.5000 0.01 0.0013 0.0018 0.0025 0.0034 0.0045 0.0060 0.0080 0.0104 0.0136 0.0174 0.0222 0.0281 0.0351 0.0436 0.0537 0.0655 0.0793 0.0951 0.1131 0.1335 0.1562 0.1814 0.2090 0.2389 0.2709 0.3050 0.3409 0.3783 0.4168 0.4562 0.4960 0.02 0.0013 0.0018 0.0024 0.0033 0.0044 0.0059 0.0078 0.0102 0.0132 0.0170 0.0217 0.0274 0.0344 0.0427 0.0526 0.0643 0.0778 0.0934 0.1112 0.1314 0.1539 0.1788 0.2061 0.2358 0.2676 0.3015 0.3372 0.3745 0.4129 0.4522 0.4920 0.03 0.0012 0.0017 0.0023 0.0032 0.0043 0.0057 0.0075 0.0099 0.0129 0.0166 0.0212 0.0268 0.0336 0.0418 0.0516 0.0630 0.0764 0.0918 0.1093 0.1292 0.1515 0.1762 0.2033 0.2327 0.2643 0.2981 0.3336 0.3707 0.4090 0.4483 0.4880 0.04 0.0012 0.0016 0.0023 0.0031 0.0041 0.0055 0.0073 0.0096 0.0125 0.0162 0.0207 0.0262 0.0329 0.0409 0.0505 0.0618 0.0749 0.0901 0.1075 0.1271 0.1492 0.1736 0.2005 0.2296 0.2611 0.2946 0.3300 0.3669 0.4052 0.4443 0.4840 0.05 0.0011 0.0016 0.0022 0.0030 0.0040 0.0054 0.0071 0.0094 0.0122 0.0158 0.0202 0.0256 0.0322 0.0401 0.0495 0.0606 0.0735 0.0885 0.1056 0.1251 0.1469 0.1711 0.1977 0.2266 0.2578 0.2912 0.3264 0.3632 0.4013 0.4404 0.4801 0.06 0.0011 0.0015 0.0021 0.0029 0.0039 0.0052 0.0069 0.0091 0.0119 0.0154 0.0197 0.0250 0.0314 0.0392 0.0485 0.0594 0.0721 0.0869 0.1038 0.1230 0.1446 0.1685 0.1949 0.2236 0.2546 0.2877 0.3228 0.3594 0.3974 0.4364 0.4761 0.07 0.0011 0.0015 0.0021 0.0028 0.0038 0.0051 0.0068 0.0089 0.0116 0.0150 0.0192 0.0244 0.0307 0.0384 0.0475 0.0582 0.0708 0.0853 0.1020 0.1210 0.1423 0.1660 0.1922 0.2206 0.2514 0.2843 0.3192 0.3557 0.3936 0.4325 0.4721 0.08 0.0010 0.0014 0.0020 0.0027 0.0037 0.0049 0.0066 0.0087 0.0113 0.0146 0.0188 0.0239 0.0301 0.0375 0.0465 0.0571 0.0694 0.0838 0.1003 0.1190 0.1401 0.1635 0.1894 0.2177 0.2483 0.2810 0.3156 0.3520 0.3897 0.4286 0.4681 0.09 0.0010 0.0014 0.0019 0.0026 0.0036 0.0048 0.0064 0.0084 0.0110 0.0143 0.0183 0.0233 0.0294 0.0367 0.0455 0.0559 0.0681 0.0823 0.0985 0.1170 0.1379 0.1611 0.1867 0.2148 0.2451 0.2776 0.3121 0.3483 0.3859 0.4247 0.4641 2nd Decimal Place z I n t g e r p a r t & 1st D e c i m a l 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 0 0.5000 0.5398 0.5793 0.6179 0.6554 0.6915 0.7257 0.7580 0.7881 0.8159 0.8413 0.8643 0.8849 0.9032 0.9192 0.9332 0.9452 0.9554 0.9641 0.9713 0.9772 0.9821 0.9861 0.9893 0.9918 0.9938 0.9953 0.9965 0.9974 0.9981 0.9987 0.01 0.5040 0.5438 0.5832 0.6217 0.6591 0.6950 0.7291 0.7611 0.7910 0.8186 0.8438 0.8665 0.8869 0.9049 0.9207 0.9345 0.9463 0.9564 0.9649 0.9719 0.9778 0.9826 0.9864 0.9896 0.9920 0.9940 0.9955 0.9966 0.9975 0.9982 0.9987 0.02 0.5080 0.5478 0.5871 0.6255 0.6628 0.6985 0.7324 0.7642 0.7939 0.8212 0.8461 0.8686 0.8888 0.9066 0.9222 0.9357 0.9474 0.9573 0.9656 0.9726 0.9783 0.9830 0.9868 0.9898 0.9922 0.9941 0.9956 0.9967 0.9976 0.9982 0.9987 0.03 0.5120 0.5517 0.5910 0.6293 0.6664 0.7019 0.7357 0.7673 0.7967 0.8238 0.8485 0.8708 0.8907 0.9082 0.9236 0.9370 0.9484 0.9582 0.9664 0.9732 0.9788 0.9834 0.9871 0.9901 0.9925 0.9943 0.9957 0.9968 0.9977 0.9983 0.9988 0.04 0.5160 0.5557 0.5948 0.6331 0.6700 0.7054 0.7389 0.7704 0.7995 0.8264 0.8508 0.8729 0.8925 0.9099 0.9251 0.9382 0.9495 0.9591 0.9671 0.9738 0.9793 0.9838 0.9875 0.9904 0.9927 0.9945 0.9959 0.9969 0.9977 0.9984 0.9988 0.05 0.5199 0.5596 0.5987 0.6368 0.6736 0.7088 0.7422 0.7734 0.8023 0.8289 0.8531 0.8749 0.8944 0.9115 0.9265 0.9394 0.9505 0.9599 0.9678 0.9744 0.9798 0.9842 0.9878 0.9906 0.9929 0.9946 0.9960 0.9970 0.9978 0.9984 0.9989 0.06 0.5239 0.5636 0.6026 0.6406 0.6772 0.7123 0.7454 0.7764 0.8051 0.8315 0.8554 0.8770 0.8962 0.9131 0.9279 0.9406 0.9515 0.9608 0.9686 0.9750 0.9803 0.9846 0.9881 0.9909 0.9931 0.9948 0.9961 0.9971 0.9979 0.9985 0.9989 0.07 0.5279 0.5675 0.6064 0.6443 0.6808 0.7157 0.7486 0.7794 0.8078 0.8340 0.8577 0.8790 0.8980 0.9147 0.9292 0.9418 0.9525 0.9616 0.9693 0.9756 0.9808 0.9850 0.9884 0.9911 0.9932 0.9949 0.9962 0.9972 0.9979 0.9985 0.9989 0.08 0.5319 0.5714 0.6103 0.6480 0.6844 0.7190 0.7517 0.7823 0.8106 0.8365 0.8599 0.8810 0.8997 0.9162 0.9306 0.9429 0.9535 0.9625 0.9699 0.9761 0.9812 0.9854 0.9887 0.9913 0.9934 0.9951 0.9963 0.9973 0.9980 0.9986 0.9990 0.09 0.5359 0.5753 0.6141 0.6517 0.6879 0.7224 0.7549 0.7852 0.8133 0.8389 0.8621 0.8830 0.9015 0.9177 0.9319 0.9441 0.9545 0.9633 0.9706 0.9767 0.9817 0.9857 0.9890 0.9916 0.9936 0.9952 0.9964 0.9974 0.9981 0.9986 0.9990 Finding Probabilities of Specific Ranges • Step 1 - Identify the normal distribution of interest (e.g. its mean () and standard deviation () ) • Step 2 - Identify the range of values that you wish to determine the probability of observing (yL , yU), where often the upper or lower bounds are or - • Step 3 - Transform yL and yU into Z-values: zL yL zU yU • Step 4 - Obtain P(zL Z zU) from Z-table Example - Adult Female Heights • What is the probability a randomly selected female is 5’10” or taller (70 inches)? • Step 1 - Y ~ N(63.7 , 2.5) • Step 2 - yL = 70.0 yU = • Step 3 70.0 63.7 zL 2.52 zU 2.5 • Step 4 - P(Y 70) = P(Z 2.52) = 1-P(Z2.52)=1-.9941=.0059 ( 1/170) z 2.4 2.5 2.6 .00 .9918 .9938 .9953 .01 .9920 .9940 .9995 .02 .9922 .9941 .9956 .03 .9925 .9943 .9957 Finding Percentiles of a Distribution • Step 1 - Identify the normal distribution of interest (e.g. its mean () and standard deviation () ) • Step 2 - Determine the percentile of interest 100p% (e.g. the 90th percentile is the cut-off where only 90% of scores are below and 10% are above). • Step 3 - Find p in the body of the z-table and itscorresponding z-value (zp) on the outer edge: – If 100p < 50 then use left-hand page of table – If 100p 50 then use right-hand page of table • Step 4 - Transform zp back to original units: y p z p Example - Adult Male Heights • • • • • Above what height do the tallest 5% of males lie above? Step 1 - Y ~ N(69.1 , 2.6) Step 2 - Want to determine 95th percentile (p = .95) Step 3 - P(Z1.645) = .95 Step 4 - y.95 = 69.1 + (1.645)(2.6) = 73.4 (6’,1.4”) z 1.5 1.6 1.7 .03 .9370 .9484 .9582 .04 .9382 .9495 .9591 .05 .9394 .9505 .9599 .06 .9406 .9515 .9608 Assessing Normality and Transformations • Obtain a histogram and see if mound-shaped • Obtain a normal probability plot – – – – Order data from smallest to largest and rank them (1 to n) Obtain a percentile for each: pct = (rank-0.375)/(n+0.25) Obtain the z-score corresponding to the percentile Plot observed data versus z-score, see if straight line (approx.) • Transformations that can achieve approximate normality: Data are percentage s : Y ' arcsin Y / 100 Data are counts : Y ' ln( Y 1) Data are skewed Right (and Positive) : Y ' ln( Y ) Sampling Distributions • Distribution of a Sample Statistic: The probability distribution of a sample statistic obtained from a random sample or a randomized experiment – What values can a sample mean (or proportion) take on and how likely are ranges of values? • Population Distribution: Set of values for a variable for a population of individuals. Conceptually equivalent to probability distribution in sense of selecting an individual at random and observing their value of the variable of interest Sampling Distribution of a Sample Mean • Obtain a sample of n independent measurements of a quantitative variable: Y1,…,Yn from a population with mean and standard deviation – Averages will be less variable than the individual measurements – Sampling distributions of averages will become more like a normal distribution as n increases (regardless of the shape of the population of individual measurements) 1 1 E Y E Yi n y n n 2 1 1 V Y V Yi n 2 y2 n n n 2 y n Central Limit Theorem • When random samples of size n are selected from any population with mean and finite standard deviation , the sampling distribution of the sample mean will be approximately distributed for large n: Y ~ N , n approximat ely, for large n Z-table can be used to approximate probabilities of ranges of values for sample means, as well as percentiles of their sampling distribution Sample Proportions • Counts of Successes (Y) rarely reported due to dependency on sample size (n) • More common is to report the sample proportion of successes: # of successes in sample Y p sample size n ^ ^ Ep ^ p p p (1 p ) ^ 2 V p ^ p n ^ p p (1 p ) n Sampling Distributions for Counts & Proportions • For samples of size n, counts (and thus proportions) can take on only n distinct possible outcomes • As the sample size n gets large, so do the number of possible values, and sampling distribution begins to approximate a normal distribution. Common Rule of thumb: np 10 and n(1-p) 10 to use normal approximation Y ~ N np , np (1 p ) p (1 p ) p ~ N p , n (approxima tely) ^ (approxima tely) Sampling Distribution for Y~B(n=1000,p=0.2) Sampling Distribution of X (n=1000,p=0.2) 0.035 0.03 0.025 Probability 0.02 0.015 0.01 0.005 0 1 41 81 121 161 201 241 281 321 361 401 441 481 521 561 601 641 681 721 761 801 841 881 921 961 1001 # Successes Y np 1000(.20) 200 Y np (1 p ) 1000(.2)(.8) 12.65 Using Z-Table for Approximate Probabilities • To find probabilities of certain ranges of counts or proportions, can make use of fact that the sample counts and proportions are approximately normally distributed for large sample sizes. – – – – – Define range of interest Obtain mean of the sampling distribution Obtain standard deviation of sampling distribution Transform range of interest to range of Z-values Obtain (approximate) Probabilities from Z-table ^ Coin Tossing(He ads) : P p 0.51 | n 1000 tosses ^ Range : p 0.51 Mean : p 0.50 SD : (0.5)(0.5) .0158 1000 ^ z p ^ p ^ p 0.51 0.50 0.63 .0158 P( Z 0.63) 1 P( Z 0.63) 1 .7357 .2643