Download Chapter 4 – Probability and Probability Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Chapter 4 – Probability and Probability
Distributions
Sections 4.6 - 4.10
Sec 4.6 - Variables
Variable: takes on different values (or attributes)
Random variable: cannot be predicted with certainty
Random Variables
Qualitative
eg. political affiliation,
color preference, gender
Quantitative
measureable, numeric outcomes
Discrete
eg. # heads tossed,
enrollement
Continuous
eg. Age of marriage,
income tax return amts,
height
Recall: We want to know the probability of observing a particular sample
4.7 – Probability Distributions for Discrete RVs
Discrete random variable: quantitative random variable, the variable can
only assume a countable number of values
What is the probability associated with
each value of the variable, y?
Probability Distribution of y: theoretical relative frequencies obtained from
the probabilities for each value of y
The probability distribution for a discrete r.v. y, displays the probability P(y)
associated with each value of y.
Probability Distributions – Discrete RVs
Example. Consider the tossing of 2 coins, and define the variable, y, to be the number
of heads observed. Possible values of y: 0, 1, 2.
Suppose that empirical sampling
yields the following:
y
0
1
2
freq
129
242
129
y
0
1
2
Theoretical probability distribution
of y:
y
P(y)
0
1
Empirical probability distribution of y:
0.25
0.5
**Theoretical and empirical
probability distributions
freq
129
242
129
rel. freq
0.258
0.484
0.258
4.9 – Probability Distributions for Continuous RVs
Continuous Random Variable: quantitative, variable assumes values on an
interval, un-countably many possible values
Example. Consider the random variable, y, that is the average height of 18
year old males in the US. The following is sample data collected from 400
individuals:
5.4959
5.1775
5.5252
5.5149
5.8677
6.0338
5.7611
6.0666
5.4425
6.0563
6.0389
5.8694
5.8676
5.657
5.5939
6.0166
5.5738
5.8398
5.6871
5.507
6.1842
5.7821
5.2276
5.3949
6.0263
5.1296
5.5501
6.0701
5.5281
5.8492
5.6393
6.0046
6.1379
4.88
5.3819
6.0115
5.8321
5.2287
5.5259
6.2378
5.355
5.4401
5.8159
5.0646
5.8472
5.5753
5.4692
5.443
5.531
5.5884
5.7402
6.3875
6.1127
5.5075
6.1356
5.7265
5.9682
5.5698
6.0983
5.6197
6.2809
5.3006
6.3141
5.7218
6.0568
5.8255
6.2666
6.1674
6.0101
5.7745
5.7285
5.1014
5.6116
5.8364
5.9536
6.3543
5.5446
6.0165
5.3412
5.8324
5.7134
6.059
5.9569
5.0824
5.5485
5.6261
5.8486
6.021
5.8013
6.0271
5.0287
6.1283
6.2263
5.8978
6.0826
5.4464
6.1591
6.1074
6.0809
5.6737
5.6471
5.4853
5.9461
6.0436
5.6967
5.8822
6.2048
6.1333
5.8701
5.4296
5.5771
6.1083
5.9475
5.4783
5.884
5.4195
5.6618
4.9667
6.0842
5.764
5.0979
6.0266
5.2806
5.8427
5.6159
5.7914
4.8571
5.7518
5.9826
6.0221
6.147
6.0214
6.0511
5.837
5.5411
5.8685
5.9412
5.6256
6.3245
5.8701
5.1727
6.2656
5.4449
5.6625
5.8772
4.9746
5.5297
6.0805
5.9787
5.6123
5.8874
5.0799
5.4901
5.7411
5.8428
6.2718
6.316
5.3717
5.6827
4.9793
6.0661
5.5194
6.0852
6.1343
5.9478
5.9275
5.816
5.9914
5.9585
6.0786
5.8828
5.4569
5.6197
5.4685
5.5195
6.0855
5.2129
5.6347
5.6128
5.7243
5.6584
5.4245
5.7689
5.7179
5.8168
5.95
5.7378
5.561
5.7364
5.4756
5.182
5.3421
5.758
5.5634
6.1686
5.9169
5.1582
5.4857
5.8049
6.1407
5.7264
5.7496
5.79
6.0218
5.5037
6.136
5.9231
5.7579
5.7264
5.6931
5.8045
5.6823
5.1731
5.2436
5.9424
5.8158
6.2163
6.1042
5.941
4.9846
5.9386
6.1722
5.7141
6.0471
6.2947
6.1162
5.8132
5.4572
4.923
5.665
5.7863
6.2311
5.4665
5.4851
5.1913
5.6608
5.6512
6.1833
5.2148
5.5588
5.8119
5.7858
5.3983
5.5923
6.0367
6.0458
6.1518
5.9798
6.0323
5.4616
5.7405
6.5448
5.4272
5.8076
6.1057
5.635
4.8951
6.4544
5.8282
5.799
5.2734
5.8127
6.1525
5.0873
5.8416
5.7234
5.0576
5.8679
5.7128
5.7851
5.9669
5.6306
4.9118
5.2619
5.7107
5.785
5.8351
6.0254
5.7891
5.1043
5.8639
5.4893
6.0336
5.8506
5.8335
6.4278
5.9166
5.8254
5.5214
5.7581
5.7162
5.8247
5.5251
5.1302
5.5433
6.3308
6.1923
5.6666
5.7719
5.4055
5.0933
5.9272
5.4326
5.2863
6.1558
6.0485
5.8888
6.027
5.8026
5.7367
5.6585
5.7406
5.95
5.2857
6.2109
5.4785
6.1177
6.1106
5.7776
5.5726
6.0865
5.6194
5.6912
6.6181
5.1919
5.6631
5.0959
6.0079
5.7482
5.4951
5.7582
6.1118
5.9222
5.6398
5.8039
5.9385
5.4786
6.4469
5.1963
5.113
6.4342
5.3864
6.0048
5.8154
6.4617
5.5863
5.3411
6.266
5.8124
5.4758
5.2903
6.0596
5.6678
5.7008
5.5016
5.7649
5.5847
5.9892
5.6348
5.7942
5.5351
6.1135
5.0156
5.8419
5.55
5.9654
5.1307
5.6896
5.4328
5.3639
5.9524
5.5356
6.4147
5.9354
5.8087
5.9362
6.3131
5.9155
4.8988
6.3403
Probability Distribution for Continuous RV
•
Example (ctd). The variable values have to be binned – relative frequency
histogram.
The interval lengths and numbers of bins
can be refined … 18 bins here …
40 bins here … with more data, and finer
binning, the histogram outline will approach
a smooth curve.
•
1000 data points.
Smooth curve outline
appears to be
emerging.
•
The smooth curve is
the probability
distribution associated
with variable y, the
height of an 18 yr old
male in the US.
Discrete and Continuous Probability Distributions
•
Probability distributions provide a means of quantifying the probability of
obtaining a certain sample outcome.
Note: Probabilities
are equal to the
fraction of the total
histogram area
corresponding to the
values of interest
Discrete case:
1. Probability of observing two heads
when a coin is tossed two times
is 0.25.
2. Probability of observing at least one
head is 0.5 + 0.25 = 0.75
Probability of observing
Either no heads or two
Heads is 0.25 + 0.25.
Discrete and Continuous Probability Distributions
Continuous case:
1. Does it make sense to ask “what is the
probability
that an 18 y.o. male is 5’10”?” NO
2. Note: The distribution plot was created using
relative frequencies – total area under the plot
is 1.
3. We compute the probability of a value falling
in a certain range of values, by computing the
area that lies under the distribution plot, over
that range.
The probability that an 18 y.o.
male has a height that lies between
5.7 and 5.8 feet is approx 0.1.
Half-way Summary
• So far:
1.
2.
3.
4.
5.
6.
How to create probability distributions from empirical/theoretical
discrete and continuous random variables.
How to determine probabilities of a variable attaining a certain value
(discrete) or attaining a value that lies within a certain range
(continuous).
Why is this useful? (Q: what is the probability of obtaining a
particular sample)
Some common known distributions – bionomial (discrete), normal
(continuous), t-distribution (continuous), chi-squared (continuous)
Can make assumptions about the type of distribution associated with
particular populations of interest – one of the known distributions
Can determine features of the underlying distributions by simulation,
other empirical observations
The Binomial Distribution - Discrete
Binomial Distribution properties:
1. experiment has n identical trials
2. each trial is either a success or failure (2 possible outcomes)
3. P(success) = π for every trial, fixed
Outcome of one trial does not
affect the outcome of any
4. trials are independent
5. variable, y = # of successes in the n trials other(s)
Examples.
1. y = # heads when a coin is tossed n times (success = heads)
2. y = # light bulbs that fail inspection when n selected from a batch are
tested (success = failed inspection)
3. y = # of people who test positive for a bacterial infection out of n who
have been exposed to the bacteria (success = positive test result)
The Binomial Distribution (ctd)
•
P(y) = probability of obtaining y successes in n trials of a binomial exp
Example (Computing P(y)). Suppose there is a 25% chance that a pregnancy
test fails. What is the probability that out of a sample of 5 tests, all 5 fail?
P(5) = P(the 1st test fails and the 2nd test
fails and the 3rd test fails and …
and the 5th test fails)
P ( 20)  (0.25) * (0.25) * ... * (0.25)  (0.25) 5
i.e. What is P(5)?
 0.000977
Now, what is P(2)?
The Binomial Distribution (ctd)
•
What is P(2)?
P(2) = P(1st fails and 2nd fails and rest don’t OR
1st fails and 3rd fails and rest don’t OR …)
P (2)  (0.25)(0.25)(0.75)(0.75)(0.75) 
(0.25)(0.75)(0.25)(0.75)(0.75)
 ....  (0.75)(0.75)(0.75)(0.25)(0.25)
5
P (2)   (0.25) 2 (0.75) 3
 2
5!

0.25 2 0.75 3
3!2!
 0.2637
P(2) = (# ways to select 2 failing tests out of 5)*
(probability of 2 test failing)*(probability of
3 tests not failing)
= 5C2*0.252*0.753
The Binomial Distribution (ctd)
Probability of y successes in n trials of a binomial experiment:
P( y ) 
n!
y!(n  y )!
 y (1   ) (n y )
y = # successes in n trials
n = # trials
π = probability of success on
a single trial
Mean and Standard Deviation of the Binomial Distribution:
Mean:
  n
Standard
Deviation:
  n (1   )
The Binomial Distribution (ctd)
•
Example. What is the probability that 6 out of 20 tests fail, if the
probability that any one test fails is 25%?
P ( 6) 

20!
Success = test fails
So, π = 0.25, n = 20, y = 6
0.25 6 0.7514
6!14!
20 *19 * 18 * 17 * 16 * 15
6 * 5 * 4 * 3 * 2 *1
 0.1686
•
0.25 6 0.7514
What are the mean and deviation of this distribution?
  20 * 0.25
5
  20 * 0.25(0.75)
 1.94
Note: P(y ≥ 7) = P(7) + P(8) + P(9) + … + P(20)
= 1 – P(y ≤ 6)
The Normal Distribution - Continuous
•
•
•
•
•
Bell-shaped curve, symmetric
about mean
Numerous continuous random
variables have a normal
distribution – eg. test scores,
weight, 100m sprint times
Normal curve is defined by μ
and σ
Empirical rule holds: approx
68% of the population lies
within ± 1σ of μ
P(y1 ≤ y < y2) = area under
normal curve between y=y1
and y=y2
Normal
curve, f(y)
f ( y) 
1
2 
e
 ( y   )2 2 2
The Normal Distribution
•
Computing probabilities for normally distributed populations:
1
f ( y) 
2 
e ( y   )
2 2 2
y2
P ( y1  y  y 2 )   f ( y )
y1
y2
 
y1
1
2 
e
( y   ) 2 2 2
P(5.5 ≤ x <5.7) = 0.1844
The Normal Distribution – Standard Normal
Computing probabilities (ctd):
- Normal curves vary by variable values (x-axis), depend on μ and σ, but are
identical in shape
- Standard normal distribution: μ = 0 and σ = 1
-
Tables exist for areas under this
graph (Table 1, Appendix of text)
-
In a standards normal
distribution, these are known as zvalues
x values between z = 0.5 and
z = 1.1 are measurements that lie
between 0.5 and 1.1 standard deviations
away from the mean of 0.
The Normal Distribution – Reading from the table
•
Table 1 contains areas under the
standard normal curve that lie to the
left of a particular z-value.
•
i.e. Reading the entry
corresponding to z1 we obtain
P(x < z1)
P(z<0.5)
P(z<1.1)
So
P(0.5 ≤ x < 1.1) = P(x < 1.1) - P( x < 0.5)
= 0.8643 - 0.6915
= 0.1728
z-values
P(0.5 ≤ z<1.1)
The Normal Distribution – Z-scores
•
•
We can use Table 1 for arbitrary normal distributions, as long μ and σ are
known.
This is done by standardizing the measurement values, y, to standard
normal values known as z-scores:
y
z

Example. Consider a normal distribution with μ = 25 and σ = 3.5. Compute
the probability that the value of a measurement lies between 27 and 30.
P(27  y  30)  P(
y1
y2
27  25
30  25
z
)  P ( z  1.4286 )  P( z  0.5714)
3.5
3.5
 0.9236  0.7157
z1
z2
 0.2079
There is a 20.79% probability that y takes a value between 27 and 30.
The Normal Distribution – Percentiles
•
Def: The 100pth percentile of a distribution is the value yp such that
100p% of the population values lie below yp and 100(1-p)% lie above yp.
•
To find percentiles of standard normal distribution –
– reverse lookup of Table 1
Example. Find the 33rd percentile of the standard normal distribution.
Need to find zp such that 100p% of values lies below zp. I.e. Find zp such that
P(z ≤ zp) = 33%
From Table 1: zp = -0.44
So, 33rd percentile is -0.44
The Normal Distribution – Percentiles
•
•
•
•
To apply this idea to general normal distributions, we do a reverse
standardizing:
The 100pth percentile is yp such that 100p% of measurements lie below yp.
I.e. P(yp ≤ y) = 100p%
we can find the z-score associated
with 100p%, and convert it back to y-values using: y p    z p
Example. For the normal distribution with μ =5.75 and σ = 0.4, find the
40th percentile.
•
•
•
is
From Table 1, zp = -0.25
yp = 5.75 + (-0.25)*0.4 = 5.65
The 40th percentile of this distribution
is 5.65.