Download Chapter 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Chapter 4
Basic Probability and Probability
Distributions
Probability Terminology
• Classical Interpretation: Notion of probability based
on equal likelihood of individual possibilities (coin toss
has 1/2 chance of Heads, card draw has 4/52 chance of
an Ace). Origins in games of chance.
– Outcome: Distinct result of random process (N= # outcomes)
– Event: Collection of outcomes (Ne= # of outcomes in event)
– Probability of event E: P(event E) = Ne/N
• Relative Frequency Interpretation: If an experiment
were conducted repeatedly, what fraction of time would
event of interest occur (based on empirical observation)
• Subjective Interpretation: Personal view (possibly
based on external info) of how likely a one-shot
experiment will end in event of interest
Obtaining Event Probabilities
• Classical Approach
– List all N possible outcomes of experiment
– List all Ne outcomes corresponding to event of interest (E)
– P(event E) = Ne/N
• Relative Frequency Approach
– Define event of interest
– Conduct experiment repeatedly (often using computer)
– Measure the fraction of time event E occurs
• Subjective Approach
– Obtain as much information on process as possible
– Consider different outcomes and their likelihood
– When possible, monitor your skill (e.g. stocks, weather)
Basic Probability and Rules
•
•
•
•
A,B  Events of interest
P(A), P(B)  Event probabilities
Union: Event either A or B occurs (A  B)
Mutually Exclusive: A, B cannot occur at same time
– If A,B are mutually exclusive: P(either A or B) = P(A) + P(B)
• Complement of A: Event that A does not occur (Ā)
– P(Ā) = 1- P(A)
That is: P(A) + P(Ā) = 1
• Intersection: Event both A and B occur (A  B or AB)
• P (A  B) = P(A) + P(B) - P(AB)
Conditional Probability and Independence
• Unconditional/Marginal Probability: Frequency which
event occurs in general (given no additional info). P(A)
• Conditional Probability: Probability an event (A) occurs
given knowledge another event (B) has occurred. P(A|B)
• Independent Events: Events whose unconditional and
conditional (given the other) probabilities are the same
P( A  B) P( AB)
P( A | B) 

P( B)
P( B)
P( A  B) P( AB)
P( B | A) 

P( A)
P( A)
P( A  B)  P( AB)  P( A) P( B | A)  P( B) P( A | B)
A, B independent  P( A)  P( A | B) & P( B)  P( B | A)
John Snow London Cholera Death Study
• 2 Water Companies (Let D be the event of death):
– Southwark&Vauxhall (S): 264913 customers, 3702 deaths
– Lambeth (L): 171363 customers, 407 deaths
– Overall: 436276 customers, 4109 deaths
4109
 .0094
(94 per 10000 people)
436276
3702
P( D | S ) 
 .0140 (140 per 10000 people)
264913
407
P ( D | L) 
 .0024 (24 per 10000 people)
171363
P( D) 
Note that probability of death is almost 6 times higher for S&V
customers than Lambeth customers (was important in showing how
cholera spread)
John Snow London Cholera Death Study
Water
Company
S&V
Lambeth
Total
Cholera
Death
Yes
No
Total
3702
(.0085)
407
(.0009)
4109
(.0094)
261211
(.5987)
170956
(.3919)
432167
(.9906)
264913
(.6072)
171363
(.3928)
436276
(1.0000)
(
Contingency Table with joint probabilities (in body of table) and
marginal probabilities (on edge of table)
John Snow London Cholera Death Study
Company
Death
.0140
D (.0085)
S&V
.6072
.9860
DC (.5987)
WaterUser
.0024
.3928
L
.9976
D (.0009)
DC (.3919)
Tree Diagram obtaining joint probabilities by multiplication rule
Bayes’s Rule - Updating Probabilities
• Let A1,…,Ak be a set of events that partition a sample
space such that (mutually exclusive and exhaustive):
– each set has known P(Ai) > 0 (each event can occur)
– for any 2 sets Ai and Aj, P(Ai and Aj) = 0 (events are disjoint)
– P(A1) + … + P(Ak) = 1 (each outcome belongs to one of events)
• If C is an event such that
– 0 < P(C) < 1 (C can occur, but will not necessarily occur)
– We know the probability will occur given each event Ai: P(C|Ai)
• Then we can compute probability of Ai given C occurred:
P(C | Ai ) P( Ai )
P( Ai and C )
P( Ai | C ) 

P(C | A1 ) P( A1 )    P(C | Ak ) P( Ak )
P(C )
Northern Army at Gettysburg
Regiment
I Corps
II Corps
III Corps
V Corps
VI Corps
XI Corps
XII Corps
Cav Corps
Arty Reserve
Sum
Label
A1
A2
A3
A4
A5
A6
A7
A8
A9
Initial #
10022
12884
11924
12509
15555
9839
8589
11501
2546
95369
Casualties
6059
4369
4211
2187
242
3801
1082
852
242
23045
P(Ai)
0.1051
0.1351
0.1250
0.1312
0.1631
0.1032
0.0901
0.1206
0.0267
1
P(C|Ai)
0.6046
0.3391
0.3532
0.1748
0.0156
0.3863
0.1260
0.0741
0.0951
P(C|Ai)*P(Ai)
0.0635
0.0458
0.0442
0.0229
0.0025
0.0399
0.0113
0.0089
0.0025
0.2416
P(C)
P(Ai|C)
0.2630
0.1896
0.1828
0.0949
0.0105
0.1650
0.0470
0.0370
0.0105
1.0002
• Regiments: partition of soldiers (A1,…,A9). Casualty: event C
• P(Ai) = (size of regiment) / (total soldiers) = (Column 3)/95369
• P(C|Ai) = (# casualties) / (regiment size) = (Col 4)/(Col 3)
• P(C|Ai) P(Ai) = P(Ai and C) = (Col 5)*(Col 6)
•P(C)=sum(Col 7)
• P(Ai|C) = P(Ai and C) / P(C) = (Col 7)/.2416
Example - OJ Simpson Trial
• Given Information on Blood Test (T+/T-)
– Sensitivity: P(T+|Guilty)=1
– Specificity: P(T-|Innocent)=.9957  P(T+|I)=.0043
• Suppose you have a prior belief of guilt: P(G)=p*
• What is “posterior” probability of guilt after seeing
evidence that blood matches: P(G|T+)?
P(T )  P(T G )  P(T  I )  P (G ) P (T  | G )  P ( I ) P (T  | I ) 
 p *(1)  (1  p*)(.0043)
P(T G ) P (G ) P (T  | G )
p *(1)
p*
P(G | T ) 



P(T  )
P (T  )
p *(1)  (1  p*)(.0043) .9957 p * .0043

Source: B.Forst (1996). “Evidence, Probabilities and Legal Standards for Determination of Guilt: Beyond the OJ Trial”, in
Representing OJ: Murder, Criminal Justice, and the Mass Culture, ed. G. Barak pp. 22-28. Harrow and Heston, Guilderland, NY
Probability OJ is Guilty Given He Tested Positive
1
0.9
0.8
0.7
P(G|T+)
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
p*
0.6
0.7
0.8
0.9
1
Random Variables/Probability Distributions
• Random Variable: Outcome characteristic that is not
known prior to experiment/observation
• Qualitative Variables: Characteristics that are nonnumeric (e.g. gender, race, religion, severity)
• Quantitative Variables: Characteristics that are
numeric (e.g. height, weight, distance)
– Discrete: Takes on only a countable set of possible values
– Continuous: Takes on values along a continuum
• Probability Distribution: Numeric description of
outcomes of a random variable takes on, and their
corresponding probabilities (discrete) or densities
(continuous)
Discrete Random Variables
• Discrete RV: Can take on a finite (or countably infinite)
set of possible outcomes
• Probability Distribution: List of values a random variable
can take on and their corresponding probabilities
– Individual probabilities must lie between 0 and 1
– Probabilities sum to 1
• Notation:
–
–
–
–
Random variable: Y
Values Y can take on: y1, y2, …, yk
Probabilities: P(Y=y1) = p1 … P(Y=yk) = pk
p1 + … + pk = 1
Example: Wars Begun by Year (1482-1939)
Distribution of Numbers of wars started by year
Y = # of wars stared in randomly selected year
Levels: y1=0, y2=1, y3=2, y4=3, y5=4
Probability Distribution:
Histogram
#Wars
0
1
2
3
4
Probability
0.5284
0.3231
0.1070
0.0328
0.0087
Yearr
•
•
•
•
300
200
100
0
0
1
2
3
Wars
4
More
Masters Golf Tournament 1st Round Scores
Histogram
Score
90
87
84
81
78
75
72
69
66
600
500
400
300
200
100
0
63
Frequency
Score Frequency Probability
63
1 0.000288
64
2 0.000576
65
6 0.001728
66
16 0.004608
67
46 0.013249
68
67 0.019297
69
151 0.043491
70
238 0.068548
71
337 0.097062
72
428 0.123272
73
467 0.134505
74
498 0.143433
75
397 0.114343
76
293 0.084389
77
203 0.058468
78
125 0.036002
79
78 0.022465
80
50 0.014401
81
28 0.008065
82
17 0.004896
83
7 0.002016
84
7 0.002016
85
4 0.001152
86
3 0.000864
87
1 0.000288
88
2 0.000576
Means and Variances of Random Variables
• Mean: Long-run average a random variable will take on
(also the balance point of the probability distribution)
• Expected Value is another term, however we really do
not expect that a realization of X will necessarily be
close to its mean. Notation: E(X)
• Mean and Variance of a discrete random variable:
E (Y )  Y  y1 p1  y2 p2    yk pk   yi pi


V (Y )  E (Y   )   ( yi   ) pi   y pi  
2
2
2
i
2
Rules for Means
• Linear Transformations: a + bY (where a and b are
constants): E(a+bY) = a+bY = a + bY
• Sums of random variables: X + Y (where X and Y are
random variables): E(X+Y) = X+Y = X + Y
• Linear Functions of Random Variables:
E(a1Y1++anYn) = a11+…+ann
where E(Yi)=i
Example: Masters Golf Tournament
• Mean by Round (Note ordering):
1=73.54 2=73.07 3=73.76 4=73.91
Mean Score per hole (18) for round 1:
E((1/18)X1) = (1/18)1 = (1/18)73.54 = 4.09
Mean Score versus par (72) for round 1:
E(X1-72) = X1-72 = 73.54-72= +1.54 (1.54 over par)
Mean Difference (Round 1 - Round 4):
E(X1-X4) = 1 - 4 = 73.54 - 73.91 = -0.37
Mean Total Score:
E(X1+X2+X3+X4) = 1+ 2+ 3+ 4 =
= 73.54+73.07+73.76+73.91 = 294.28 (6.28 over par)
Variance of a Random Variable
V (a  bY )  
b 
2
a  bY
2
2
Y
2
2 2
2 2
V (aX  bY )   aX

a


b
 Y  2ab X  Y
 bY
X
where  is the correlatio n between X and Y
Special Cases:
• X and Y are independent (outcome of one does not alter the
distribution of the other):  = 0, last term drops out
• a=b=1 and  = 0
V(X+Y) = X2 + Y2
• a=1 b= -1 and  = 0
• a=b=1 and  0
V(X-Y) = X2 + Y2
V(X+Y) = X2 + Y2 + 2XY
• a=1 b= -1 and  0
V(X-Y) = X2 + Y2 -2XY
Examples - Wars & Masters Golf
#Wars
0
1
2
3
4
Sum
Probability
0.5284
0.3231
0.1070
0.0328
0.0087
1.0000
x*p
0.0000
0.3231
0.2140
0.0983
0.0349
0.6703
=0.67
Score
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
Sum
prob
0.000288
0.000576
0.001728
0.004608
0.013249
0.019297
0.043491
0.068548
0.097062
0.123272
0.134505
0.143433
0.114343
0.084389
0.058468
0.036002
0.022465
0.014401
0.008065
0.004896
0.002016
0.002016
0.001152
0.000864
0.000288
0.000576
1
x*p
0.0181
0.0369
0.1123
0.3041
0.8877
1.3122
3.0009
4.7984
6.8914
8.8756
9.8188
10.6141
8.5757
6.4136
4.5020
2.8082
1.7748
1.1521
0.6532
0.4015
0.1673
0.1694
0.0979
0.0743
0.0251
0.0507
73.54
=73.54
Binomial Distribution for Sample Counts
• Binomial “Experiment”
– Consists of n trials or observations
– Trials/observations are independent of one another
– Each trial/observation can end in one of two possible
outcomes often labelled “Success” and “Failure”
– The probability of success, p, is constant across
trials/observations
– Random variable, Y, is the number of successes observed in
the n trials/observations.
• Binomial Distributions: Family of distributions for Y,
indexed by Success probability (p) and number of
trials/observations (n). Notation: Y~B(n,p)
Binomial Distributions and Sampling
• Problem when sampling from a finite population: the
sequence of probabilities of Success is altered after
observing earlier individuals.
• When the population is much larger than the sample
(say at least 20 times as large), the effect is minimal and
we say X is approximately binomial
• Obtaining probabilities:
n y
P(Y  y )  P( y )   p (1  p ) n y
 y
n
n!
  
 y  y!(n  y )!
y  0,1,, n
Example - Diagnostic Test
• Test claims to have a sensitivity of 90% (Among people
with condition, probability of testing positive is .90)
• 10 people who are known to have condition are
identified, Y is the number that correctly test positive
10  k 10k
P(Y  k )   (.9) (.1)
k
k
P(k)
0
1E-10
1
9E-09
10 
10!
  
k  0,1,,10
 k  k!(10  k )!
2
3
4
5
6
7
8
9
10
3.64E-07 8.75E-06 0.000138 0.001488 0.01116 0.057396 0.19371 0.38742 0.348678
•Table obtained in EXCEL with function: BINOMDIST(k,n,p,FALSE)
(TRUE option gives cumulative distribution function: P(Yk)
Binomial Mean & Standard Deviation
•
•
•
•
•
•
•
Let Si=1 if the ith individual was a success, 0 otherwise
Then P(Si=1) = p and P(Si=0) = 1-p
Then E(Si)=S = 1(p) + 0(1-p) = p
Note that Y = S1+…+Sn and that trials are independent
Then E(Y)=Y = nS = np
V(Si) = E(Si2)-S2 = p-p2 = p(1-p)
Then V(Y)=Y2 = np(1-p)
Y ~ B(n, p )
E(Y )  Y  np
 Y  np (1  p )
For the diagnostic test :   10(0.9)  9.0   10(0.9)(0.1)  0.95
Continuous Random Variables
• Variable can take on any value along a continuous
range of numbers (interval)
• Probability distribution is described by a smooth
density curve
• Probabilities of ranges of values for Y correspond to
areas under the density curve
– Curve must lie on or above the horizontal axis
– Total area under the curve is 1
• Special case: Normal distributions
Normal Distribution
• Bell-shaped, symmetric family of distributions
• Classified by 2 parameters: Mean () and standard
deviation (). These represent location and spread
• Random variables that are approximately normal have
the following properties wrt individual measurements:
–
–
–
–
Approximately half (50%) fall above (and below) mean
Approximately 68% fall within 1 standard deviation of mean
Approximately 95% fall within 2 standard deviations of mean
Virtually all fall within 3 standard deviations of mean
• Notation when Y is normally distributed with mean 
and standard deviation  :
Y ~ N ( , )
Two Normal Distributions
Normal Distribution
P(Y   )  0.50 P(     Y     )  0.68 P(   2  Y    2 )  0.95
Example - Heights of U.S. Adults
• Female and Male adult heights are well approximated by
normal distributions: YF~N(63.7,2.5) YM~N(69.1,2.6)
20
20
18
16
14
12
10
10
8
6
4
Std. Dev = 2.48
Std. Dev = 2.61
2
Mean = 63.7
Mean = 69.1
0
N = 99.68
55.5
57.5
56.5
59.5
58.5
61.5
60.5
63.5
62.5
65.5
64.5
67.5
66.5
69.5
68.5
70.5
N = 99.23
0
59.5 61.5 63.5 65.5 67.5 69.5 71.5 73.5 75.5
60.5 62.5 64.5 66.5 68.5 70.5 72.5 74.5 76.5
INCHESF
INCHESM
Cases weighted by PCTF
Cases weighted by PCTM
Source: Statistical Abstract of the U.S. (1992)
Standard Normal (Z) Distribution
• Problem: Unlimited number of possible normal
distributions (- <  <  ,  > 0)
• Solution: Standardize the random variable to have
mean 0 and standard deviation 1
Y ~ N ( , )  Z 
Y 

~ N (0,1)
• Probabilities of certain ranges of values and specific
percentiles of interest can be obtained through the
standard normal (Z) distribution
Standard Normal (Z) Distribution
Standard Normal (Z) Distribution
0.45
0.4
0.35
Table Area
0.3
1-Table Area
f(z)
0.25
0.2
0.15
0.1
0.05
0
-4
-3
-2
-1
0
z
1
z
2
3
4
2nd Decimal Place
I
n
t
g
e
r
p
a
r
t
&
1st
D
e
c
i
m
a
l
z
-3.0
-2.9
-2.8
-2.7
-2.6
-2.5
-2.4
-2.3
-2.2
-2.1
-2.0
-1.9
-1.8
-1.7
-1.6
-1.5
-1.4
-1.3
-1.2
-1.1
-1.0
-0.9
-0.8
-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
-0.0
0.00
0.0013
0.0019
0.0026
0.0035
0.0047
0.0062
0.0082
0.0107
0.0139
0.0179
0.0228
0.0287
0.0359
0.0446
0.0548
0.0668
0.0808
0.0968
0.1151
0.1357
0.1587
0.1841
0.2119
0.2420
0.2743
0.3085
0.3446
0.3821
0.4207
0.4602
0.5000
0.01
0.0013
0.0018
0.0025
0.0034
0.0045
0.0060
0.0080
0.0104
0.0136
0.0174
0.0222
0.0281
0.0351
0.0436
0.0537
0.0655
0.0793
0.0951
0.1131
0.1335
0.1562
0.1814
0.2090
0.2389
0.2709
0.3050
0.3409
0.3783
0.4168
0.4562
0.4960
0.02
0.0013
0.0018
0.0024
0.0033
0.0044
0.0059
0.0078
0.0102
0.0132
0.0170
0.0217
0.0274
0.0344
0.0427
0.0526
0.0643
0.0778
0.0934
0.1112
0.1314
0.1539
0.1788
0.2061
0.2358
0.2676
0.3015
0.3372
0.3745
0.4129
0.4522
0.4920
0.03
0.0012
0.0017
0.0023
0.0032
0.0043
0.0057
0.0075
0.0099
0.0129
0.0166
0.0212
0.0268
0.0336
0.0418
0.0516
0.0630
0.0764
0.0918
0.1093
0.1292
0.1515
0.1762
0.2033
0.2327
0.2643
0.2981
0.3336
0.3707
0.4090
0.4483
0.4880
0.04
0.0012
0.0016
0.0023
0.0031
0.0041
0.0055
0.0073
0.0096
0.0125
0.0162
0.0207
0.0262
0.0329
0.0409
0.0505
0.0618
0.0749
0.0901
0.1075
0.1271
0.1492
0.1736
0.2005
0.2296
0.2611
0.2946
0.3300
0.3669
0.4052
0.4443
0.4840
0.05
0.0011
0.0016
0.0022
0.0030
0.0040
0.0054
0.0071
0.0094
0.0122
0.0158
0.0202
0.0256
0.0322
0.0401
0.0495
0.0606
0.0735
0.0885
0.1056
0.1251
0.1469
0.1711
0.1977
0.2266
0.2578
0.2912
0.3264
0.3632
0.4013
0.4404
0.4801
0.06
0.0011
0.0015
0.0021
0.0029
0.0039
0.0052
0.0069
0.0091
0.0119
0.0154
0.0197
0.0250
0.0314
0.0392
0.0485
0.0594
0.0721
0.0869
0.1038
0.1230
0.1446
0.1685
0.1949
0.2236
0.2546
0.2877
0.3228
0.3594
0.3974
0.4364
0.4761
0.07
0.0011
0.0015
0.0021
0.0028
0.0038
0.0051
0.0068
0.0089
0.0116
0.0150
0.0192
0.0244
0.0307
0.0384
0.0475
0.0582
0.0708
0.0853
0.1020
0.1210
0.1423
0.1660
0.1922
0.2206
0.2514
0.2843
0.3192
0.3557
0.3936
0.4325
0.4721
0.08
0.0010
0.0014
0.0020
0.0027
0.0037
0.0049
0.0066
0.0087
0.0113
0.0146
0.0188
0.0239
0.0301
0.0375
0.0465
0.0571
0.0694
0.0838
0.1003
0.1190
0.1401
0.1635
0.1894
0.2177
0.2483
0.2810
0.3156
0.3520
0.3897
0.4286
0.4681
0.09
0.0010
0.0014
0.0019
0.0026
0.0036
0.0048
0.0064
0.0084
0.0110
0.0143
0.0183
0.0233
0.0294
0.0367
0.0455
0.0559
0.0681
0.0823
0.0985
0.1170
0.1379
0.1611
0.1867
0.2148
0.2451
0.2776
0.3121
0.3483
0.3859
0.4247
0.4641
2nd Decimal Place
z
I
n
t
g
e
r
p
a
r
t
&
1st
D
e
c
i
m
a
l
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
0
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713
0.9772
0.9821
0.9861
0.9893
0.9918
0.9938
0.9953
0.9965
0.9974
0.9981
0.9987
0.01
0.5040
0.5438
0.5832
0.6217
0.6591
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.9463
0.9564
0.9649
0.9719
0.9778
0.9826
0.9864
0.9896
0.9920
0.9940
0.9955
0.9966
0.9975
0.9982
0.9987
0.02
0.5080
0.5478
0.5871
0.6255
0.6628
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.9474
0.9573
0.9656
0.9726
0.9783
0.9830
0.9868
0.9898
0.9922
0.9941
0.9956
0.9967
0.9976
0.9982
0.9987
0.03
0.5120
0.5517
0.5910
0.6293
0.6664
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.9484
0.9582
0.9664
0.9732
0.9788
0.9834
0.9871
0.9901
0.9925
0.9943
0.9957
0.9968
0.9977
0.9983
0.9988
0.04
0.5160
0.5557
0.5948
0.6331
0.6700
0.7054
0.7389
0.7704
0.7995
0.8264
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.9495
0.9591
0.9671
0.9738
0.9793
0.9838
0.9875
0.9904
0.9927
0.9945
0.9959
0.9969
0.9977
0.9984
0.9988
0.05
0.5199
0.5596
0.5987
0.6368
0.6736
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8944
0.9115
0.9265
0.9394
0.9505
0.9599
0.9678
0.9744
0.9798
0.9842
0.9878
0.9906
0.9929
0.9946
0.9960
0.9970
0.9978
0.9984
0.9989
0.06
0.5239
0.5636
0.6026
0.6406
0.6772
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.9515
0.9608
0.9686
0.9750
0.9803
0.9846
0.9881
0.9909
0.9931
0.9948
0.9961
0.9971
0.9979
0.9985
0.9989
0.07
0.5279
0.5675
0.6064
0.6443
0.6808
0.7157
0.7486
0.7794
0.8078
0.8340
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.9525
0.9616
0.9693
0.9756
0.9808
0.9850
0.9884
0.9911
0.9932
0.9949
0.9962
0.9972
0.9979
0.9985
0.9989
0.08
0.5319
0.5714
0.6103
0.6480
0.6844
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.9535
0.9625
0.9699
0.9761
0.9812
0.9854
0.9887
0.9913
0.9934
0.9951
0.9963
0.9973
0.9980
0.9986
0.9990
0.09
0.5359
0.5753
0.6141
0.6517
0.6879
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
0.9545
0.9633
0.9706
0.9767
0.9817
0.9857
0.9890
0.9916
0.9936
0.9952
0.9964
0.9974
0.9981
0.9986
0.9990
Finding Probabilities of Specific Ranges
• Step 1 - Identify the normal distribution of interest (e.g.
its mean () and standard deviation () )
• Step 2 - Identify the range of values that you wish to
determine the probability of observing (yL , yU), where
often the upper or lower bounds are  or -
• Step 3 - Transform yL and yU into Z-values:
zL 
yL  

zU 
yU  

• Step 4 - Obtain P(zL Z  zU) from Z-table
Example - Adult Female Heights
• What is the probability a randomly selected female is
5’10” or taller (70 inches)?
• Step 1 - Y ~ N(63.7 , 2.5)
• Step 2 - yL = 70.0 yU = 
• Step 3 70.0  63.7
zL 
 2.52
zU  
2.5
• Step 4 - P(Y 70) = P(Z  2.52) =
1-P(Z2.52)=1-.9941=.0059 (  1/170)
z
2.4
2.5
2.6
.00
.9918
.9938
.9953
.01
.9920
.9940
.9995
.02
.9922
.9941
.9956
.03
.9925
.9943
.9957
Finding Percentiles of a Distribution
• Step 1 - Identify the normal distribution of interest (e.g.
its mean () and standard deviation () )
• Step 2 - Determine the percentile of interest 100p% (e.g.
the 90th percentile is the cut-off where only 90% of scores are
below and 10% are above).
• Step 3 - Find p in the body of the z-table and
itscorresponding z-value (zp) on the outer edge:
– If 100p < 50 then use left-hand page of table
– If 100p 50 then use right-hand page of table
• Step 4 - Transform zp back to original units:
y p    z p
Example - Adult Male Heights
•
•
•
•
•
Above what height do the tallest 5% of males lie above?
Step 1 - Y ~ N(69.1 , 2.6)
Step 2 - Want to determine 95th percentile (p = .95)
Step 3 - P(Z1.645) = .95
Step 4 - y.95 = 69.1 + (1.645)(2.6) = 73.4 (6’,1.4”)
z
1.5
1.6
1.7
.03
.9370
.9484
.9582
.04
.9382
.9495
.9591
.05
.9394
.9505
.9599
.06
.9406
.9515
.9608
Assessing Normality and Transformations
• Obtain a histogram and see if mound-shaped
• Obtain a normal probability plot
–
–
–
–
Order data from smallest to largest and rank them (1 to n)
Obtain a percentile for each: pct = (rank-0.375)/(n+0.25)
Obtain the z-score corresponding to the percentile
Plot observed data versus z-score, see if straight line (approx.)
• Transformations that can achieve approximate normality:


Data are percentage s : Y '  arcsin Y / 100
Data are counts : Y '  ln( Y  1)
Data are skewed Right (and Positive) : Y '  ln( Y )
Sampling Distributions
• Distribution of a Sample Statistic: The probability
distribution of a sample statistic obtained from a
random sample or a randomized experiment
– What values can a sample mean (or proportion)
take on and how likely are ranges of values?
• Population Distribution: Set of values for a
variable for a population of individuals.
Conceptually equivalent to probability distribution
in sense of selecting an individual at random and
observing their value of the variable of interest
Sampling Distribution of a Sample Mean
• Obtain a sample of n independent measurements of a
quantitative variable: Y1,…,Yn from a population with
mean  and standard deviation 
– Averages will be less variable than the individual
measurements
– Sampling distributions of averages will become more like a
normal distribution as n increases (regardless of the shape of
the population of individual measurements)

1
 1
E Y  E   Yi    n   y  
n
 n

2
1
1


  
V Y  V   Yi     n 2   y2 
n
n
 n
2
y 

n
Central Limit Theorem
• When random samples of size n are selected from any
population with mean  and finite standard deviation ,
the sampling distribution of the sample mean will be
approximately distributed for large n:
  
Y ~ N  ,

n

approximat ely, for large n
Z-table can be used to approximate probabilities of ranges of
values for sample means, as well as percentiles of their sampling
distribution
Sample Proportions
• Counts of Successes (Y) rarely reported due to
dependency on sample size (n)
• More common is to report the sample proportion of
successes:
# of successes in sample Y
p

sample size
n
^
^
Ep    ^  p
p
 
p (1  p )
^
2
V p    ^ 
p
n
 
 
^
p
p (1  p )
n
Sampling Distributions for Counts &
Proportions
• For samples of size n, counts (and thus proportions)
can take on only n distinct possible outcomes
• As the sample size n gets large, so do the number of
possible values, and sampling distribution begins to
approximate a normal distribution. Common Rule of
thumb: np  10 and n(1-p)  10 to use normal
approximation

Y ~ N np , np (1  p )

p
(1  p ) 

p ~ N  p ,

n



(approxima tely)
^
(approxima tely)
Sampling Distribution for Y~B(n=1000,p=0.2)
Sampling Distribution of X (n=1000,p=0.2)
0.035
0.03
0.025
Probability
0.02
0.015
0.01
0.005
0
1
41
81
121
161
201
241
281
321
361
401
441
481
521
561
601
641
681
721
761
801
841
881
921
961 1001
# Successes
Y  np  1000(.20)  200  Y  np (1  p )  1000(.2)(.8)  12.65
Using Z-Table for Approximate Probabilities
• To find probabilities of certain ranges of counts or proportions,
can make use of fact that the sample counts and proportions are
approximately normally distributed for large sample sizes.
–
–
–
–
–
Define range of interest
Obtain mean of the sampling distribution
Obtain standard deviation of sampling distribution
Transform range of interest to range of Z-values
Obtain (approximate) Probabilities from Z-table
^

Coin Tossing(He ads) : P p  0.51 | n  1000 tosses 


^
Range : p  0.51 Mean : p  0.50 SD :
(0.5)(0.5)
 .0158
1000
^
z
p

^
p
^
p

0.51  0.50
 0.63
.0158
P( Z  0.63)  1  P( Z  0.63)  1  .7357  .2643
Related documents