Download Bernoulli Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
Bernoulli Distribution
A Bernoulli random variable is one which has
only 0 and 1 as possible values.
Let p  P( X  1)
Thus a Bernoulli distribution X has the
following “table”
Possible
0
values of X
Probabilities 1-p
1
p
Definition: Say that X B(1, p )
Generically, we say that X=1 is a success and
X=0 is a failure. We say that p is the “success”
probability.
A Bernoulli random variable is the simplest
random variable. It models an experiment in
which there are only two outcomes.
Example: A fair die is tossed. Let X = 1 only if
the first toss shows a “4” or “5”. Then
X
1
B(1, )
3
Mean and Variance: For a Bernoulli random
variable with success probability p:
 X  0(1  p )  1 p  p
 X2  02 (1  p )  12 p  p 2
= p  p 2  p(1  p )
Binomial Distribution
Consider n independent Bernoulli random
variables, all with the same success probability
p.
The sum Y  X 1   X n is called a binomial
random variable with parameters n and p .
Notation: Say Y B(n, p )
The binomial random variables arise in
circumstances which warrant the name
“binomial experiment”. A binomial experiment
is one in which:
1. n trials are to be performed. The number n
is predetermined in advance.
2. Each trial consists of an experiment in which
only one of two possible outcomes are possible.
The outcomes are labeled , generically, success
and failure.
3. Each trial has the same probability p of
success.
4. The trials are independent.
Binomial random variables address the
following situations:
a) If I toss a coin 10 times, how many heads
will appear?
b) If a machine produces, on average, 5%
defectives, then how many defectives
would there be among a sample of size
200?
Mean and Variance:
Because a binomial random variable is a linear
combination of Bernoullis:
If Y is Binomial with parameters n and p, the
mean and variance are given by:
Y  np
 Y2  np(1  p )
Population Sampling:
Suppose that a population of N people is
comprised of M “successes” and N-M “failures”.
Suppose we wish to extract a sample of size n,
one by one, from this population. We are
interested in recording the number of
“successes” we extract.
If, after each human is sampled and recorded for
either “success” or “failure” we replace
him/her back into the population to be sampled,
then the samples are all independent and all have
the same probability of producing a success.
Thus sampling with replacement is a binomial
experiment. The number of successes then has
B( n, MN ) distribution.
If the humans are not replaced after sampling
(sampling without replacement) then the
resulting trials are not independent. In this case,
the distribution is not binomial.
However, if the sample size n is much smaller
than the total population N. (Say N is 40 times
more than the sample size n), then sampling
without replacement is virtually
indistinguishable from sampling with
replacement. In this case, although the true
distribution is not exactly binomial, the
distribution is approximately binomial.
The actual interpretation of success and failure
could vary. We could be talking about “men”
and “women”, “people in support of measure
A” and “people not in support of measure A”,
etc etc…
The main thing is that, as long as we sample a
portion that is much smaller than the total
population, the number of “successes” will be
considered to be binomial.
Estimating the success probability p
Suppose you have a binomial situation, except
you don’t know the true value of p. What is the
best way to guess p? For instance, if you select
(with replacement) 50 marbles from a jar of red
and green ones, and 34 of them are green, then
your best guess is that 68% are green. In fact,
this is an unbiased strategy!
Fact: the sample proportion:
pˆ 
Y number of successes in n trials

n
n
is unbiased:
 pˆ 
E (Y ) np

p
n
n
and has standard deviation given by:
 pˆ 
V (Y )

n2
p(1  p)
n
In practice, the we replace p with p̂ to estimate
the standard deviation. Notice that the standard
deviation gets smaller as the sample size n
increases!
Large Sample Binomial probabilities
For small sample, probabilities associated with a
binomial random variable can be calculated
directly or computed from the so-called
“binomial tables”. We are going to focus on an
approximation technique used when the sample
size is large.
Fact: Let n be large. Let X B(n, p ) ; that is,
let X be the number of successes in a binomial
experiment of n trials and success probability p.
Then:
X is approximately normal with mean np
and standard deviation np(1  p)
X
ˆ
p

Furthermore, if
n is the proportion of
successes in n trials, then:
p̂ is approximately normal with mean
p(1  p)
p and standard deviation
n
The rule of thumb for n to be large enough is
that np > 10 and n(1  p )  10 .
*Optional Topic*
When using the normal distribution to
approximate a binomial distribution, there is a
technique that is used to improve the accuracy of
the approximation. This is called the continuity
correction. The improvement is usually very
minor.
Continuity Correction: Suppose X is binomial
and n is large enough to use the normal
approximation. Recall that X takes only natural
number values 0, 1, 2, 3, …. and that the normal
curve is continuous.
If a and b are natural numbers,
a) Write a probability of the form
P( X  a ) as P( X  a  0.5) and then use
the normal approximation.
b) Write a probability of the form
P( X  b) as P( X  b  0.5) and then use
the normal approximation.
c) Write a probability of the form
P(a  X  b) as P(a  0.5  X  b  0.5) .
Before applying the continuity correction, be
sure to write the appropriate binomial
probability in one of the above 3 forms.
*End of Optional Topic*
Example: A multiple choice test consists of 100
questions. Each question has 4 possible
responses, only one of which is correct. Jane
randomly guesses at each question on the test.
Let X denote the number of questions
Jane answers correctly
a) What is the mean and standard deviation
of X?
b) What is the probability Jane answers at
least 35 correctly?
Solution: X is binomial with n=100 and p=.25.
The mean and standard deviation are:
 X  np  100(.25)  25
 X  np(1  p )  100(.25)(.75)  4.33
For part b, we note that np = 25 and
n (1  p )  75 are both bigger than 10, so we can
use the normal distribution to find the
approximate probability. Since the z-score of
35 is 2.31 (check: (35-25)/4.33 = 2.31), we get
P( X  35)  P( Z  2.31)  .0104
Thus the probability that Jane answers at least
35 correctly is about 1.04%
(note: part b was calculated without the
continuity correction)
Example: Suppose that, in a large city, 40% of
all registered voters voted in the last election.
Suppose we take a random sample of n = 1500
registered voters. Let p̂ denote the sample
proportion who voted in the last election. (Thus
p̂ is the proportion of this sample of size 1500,
not the proportion of all registered voters.)
What is the probability that .38  pˆ  .42 ? In
english, we are asking what is the probability
that the sample proportion differs from the true
proportion by no more than 2 percentage points?
Solution:
Assuming that the population of registered
voters is at least 40 times the size of the sample
(which is very likely for a large city), then the
number of “successes” (people who voted in the
last election) out of n = 1500 is modeled by a
binomial random variable.
Thus p̂ has mean and standard deviation
 pˆ  p  .40
 pˆ 
p(1  p)
(.40)(.60)

 0.01265
n
1500
Now, to calculate P(.38  pˆ  .42) , we need to
find the z-scores of .38 and .42 and use the
normal distribution. (Here, n is clearly large
enough since np = 1500(.40) = 600 and n(1-p) =
1500(.60) = 900).
.38  .40
The z-score of .38 is 0.01265  1.58
.42  .40
The z-score of .42 is 0.01265  1.58
Thus
P(.38  pˆ  .42)  P( 1.58  Z  1.58)  .8858
Therefore, there is an 88.58% chance that the
sample proportion will differ from the true
value of 40% by no more than 2%
(once again, the continuity correction was not
used to calculate this probability)