Download Unit 2: Probability and distributions Lecture 2: Binomial and Normal

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Today’s agenda
Unit 2: Probability and distributions
Lecture 2: Binomial and Normal distribution
Tree diagrams recap (from the reading 2.2.6)
The binomial distribution and its parameters
Statistics 101
The normal distribution and its parameters
Gary Larson
The normal approximation to the binomial
Relevant reading: 3.4 (thru p142), 3.1.
July 7, 2015
Statistics 101 (Gary Larson)
Binary outcomes
July 7, 2015
2 / 57
Binary outcomes
Milgram experiment
Milgram experiment (cont.)
Stanley Milgram, a Yale University
psychologist, conducted a series of
experiments on obedience to
authority starting in 1963.
These experiments measured the willingness of participants to
obey an authority who instructed them to perform acts conflicting
with their personal conscience.
Experimenter (E) orders the teacher
(T), the subject of the experiment, to
give severe electric shocks to a
learner (L) each time the learner
answers a question incorrectly.
About 65% of people would obey authority and give such shocks,
and only 35% refused.
Over the years, additional research suggested this number is
approximately consistent across communities and time.
The learner is an actor and the
shocks are not real. But a
prerecorded sound is played each
time the teacher administers a
shock.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
U2 - L2: Normal distribution
July 7, 2015
3 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
4 / 57
Binary outcomes
Binomial distribution
Suppose we randomly select four individuals to participate in this experiment. What is the probability that exactly 1 of them will refuse to
administer the shock?
Binary outcomes
Observing the behavior of each “teacher” in Milgram’s
experiment can be thought of as a trial which has two possible
outcomes (binary outcome).
Let’s call these people Allen (A), Brittany (B), Caroline (C), and Damian (D). Each
one of the four scenarios below will satisfy the condition of “exactly 1 of them refuses
to administer the shock”:
A trial is labeled a success if that teacher refuses to administer a
severe shock, and failure if he does administer a shock.
Scenario 1
Scenario 2
Since only 35% of people refused to administer a shock, the
probability of success is p = 0.35.
Scenario 3
When an individual trial has a binary outcome of
“success”/“failure” and an associated probability of success, the
trial is also called a Bernoulli random variable.
Scenario 4
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Binomial distribution
July 7, 2015
(A) refuse
.35
(A) shock
.65
(A) shock
.65
(A) shock
.65
×
×
×
×
(B) shock
.65
(B) refuse
.35
(B) shock
.35
(B) shock
.35
×
×
×
×
(C) shock
.65
(C) shock
.65
(C) refuse
.35
(C) shock
.65
×
×
×
×
(D) shock
.65
(D) shock
.65
(D) shock
.65
(D) refuse
.35
= 0.0961
= 0.0961
= 0.0961
= 0.0961
The probability of exactly one 1 of 4 people refusing to administer the shock is the
sum of all of these probabilities.
Today we are studying sequences of Bernoulli trials, using the
binomial distribution.
0.0961 + 0.0961 + 0.0961 + 0.0961 = 4 × 0.0961 = 0.3844
5 / 57
Statistics 101 (Gary Larson)
The binomial distribution
U2 - L2: Normal distribution
Binomial distribution
Binomial distribution
July 7, 2015
6 / 57
The binomial distribution
Counting the # of scenarios
The question from the prior slide asked for the probability of given
number of successes, k , in a given number of trials, n, (k = 1
success in n = 4 trials), and we calculated this probability as
Earlier we wrote out all possible scenarios that fit the condition of
exactly one person refusing to administer the shock. If n was larger
and/or k was different than 1, for example, n = 9 and k = 2:
# of scenarios × P (single scenario )
RRSSSSSSS
SRRSSSSSS
SSRRSSSSS
# of scenarios: there is a less tedious way to figure this out,
···
we’ll get to that shortly...
SSRSSRSSS
P (single scenario ) = p k (1 − p )(n−k )
···
probability of success to the power of number of successes, probability of failure to the power of number of failures
SSSSSSSRR
The binomial distribution describes the probability of having exactly k
successes in n independent Bernouilli trials with probability of
success p.
Statistics 101 (Gary Larson)
Considering many scenarios
U2 - L2: Normal distribution
July 7, 2015
writing out all possible scenarios would be incredibly tedious and
prone to errors.
7 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
8 / 57
Binomial distribution
The binomial distribution
Binomial distribution
Calculating the # of scenarios
The binomial distribution
Conditions for the Binomial Distribution
Choose function
The choose function is useful for calculating the number of ways to
choose k successes in n trials.
!
n
n!
=
k
k !(n − k )!
k = 1, n = 4:
4
k = 2, n = 9:
9
1
2
=
4!
1!(4−1)!
=
9!
2!(9−2)!
=
4×3×2×1
1×(3×2×1)
=4
=
9×8×7!
2×1×7!
72
2
=
1
The number of trials n is fixed.
2
The trials are independent.
3
Each trial outcome can be classified as a success or failure.
4
The probability of success p is the same for each trial.
The parameters of the binomial distribution are n and p
= 36
Note: You can also use R for these calculations:
> choose(9,2)
[1] 36
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Binomial distribution
July 7, 2015
9 / 57
Statistics 101 (Gary Larson)
The binomial distribution
U2 - L2: Normal distribution
Binomial distribution
Binomial distribution (cont.)
July 7, 2015
10 / 57
Example
Example
Binomial probabilities
If p represents probability of success, (1 − p ) represents probability of
failure, n represents number of independent trials, and k represents
number of successes
!
P (k successes in n trials ) =
n k
p (1 − p )(n−k )
k
At Duke University, 82% of students live in university owned or
affiliated housing. A group of 12 students was randomly chosen to
speak to incoming students at orientation. What is the probability that
(1) zero, (2) one, (3) two of these students live in student housing?
P (Zero UH) =
P (One UH) =
P (Two UH) =
# of scenarios × P (single scenario )
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
11 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
12 / 57
Binomial distribution
Example
Binomial distribution
Example
Example
Participation question
Which of the following is not a condition that needs to be met for the
binomial distribution to be applicable?
Now suppose we are going to randomly select 5 of these students to
be on a panel to speak to the parents. What is the probability that at
least four (ie four or more) of these selected students live in student
housing?
(a) the trials must be independent
(b) the number of trials, n, must be fixed
(c) each trial outcome must be classifiable as a success or a failure
P (At least four) =
(d) the number of desired successes, k , must be greater than the
number of trials
P (At least one) =
(e) the probability of success, p, must be the same for each trial
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Binomial distribution
July 7, 2015
13 / 57
Statistics 101 (Gary Larson)
Example
U2 - L2: Normal distribution
Binomial distribution
July 7, 2015
14 / 57
Example
Participation question
Participation question
A 2012 Gallup survey suggests that 26.2% of Americans are obese.
Among a random sample of 10 Americans, what is the probability that
exactly 8 are obese?
A 2012 Gallup survey suggests that 26.2% of Americans are obese.
Among a random sample of 10 Americans, what is the probability that
exactly 8 are obese?
(a) pretty high
(a) 0.2628 × 0.7382
(b) pretty low
(b)
8
× 0.2628 × 0.7382
10
8
2
(c) 10
8 × 0.262 × 0.738
2
8
(d) 10
8 × 0.262 × 0.738
Gallup: http:// www.gallup.com/ poll/ 160061/ obesity-rate-stable-2012.aspx , January 23, 2013.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
15 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
16 / 57
Binomial distribution
Expected value and variability of successes
Binomial distribution
Expected value
Expected value and variability of successes
Expected value and its variability
Mean and standard deviation of binomial distribution
A 2012 Gallup survey suggests that 26.2% of Americans are obese.
µ = np
Among a random sample of 100 Americans, how many would you expect to be obese?
σ=
q
np (1 − p )
Going back to the obesity rate:
Easy enough, 100 × 0.262 = 26.2.
σ=
Or more formally, µ = np = 100 × 0.262 = 26.2.
√
q
np (1 − p ) =
100 × 0.262 × 0.738 ≈ 4.4
We would expect 26.2 out of 100 randomly sampled Americans
to be obese, give or take 4.4.
But this doesn’t mean in every random sample of 100 people
exactly 26.2 will be obese. In fact, that’s not even possible. In
some samples this value will be less, and in others more. How
much would we expect this value to vary?
Note: Mean and standard deviation of a binomial might not always be whole
numbers, and that is alright, these values represent what we would expect to see on
average.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Binomial distribution
July 7, 2015
17 / 57
Statistics 101 (Gary Larson)
Expected value and variability of successes
U2 - L2: Normal distribution
Binomial distribution
July 7, 2015
18 / 57
Expected value and variability of successes
Participation question
Unusual observations
Using the notion that observations that are more than 2 standard
deviations away from the mean are considered unusual and the mean
and the standard deviation we just computed, we can calculate a
range for the plausible number of obese Americans in random
samples of 100.
An August 2012 Gallup poll suggests that 13% of Americans think
home schooling provides an excellent education for children. Would
a random sample of 1,000 Americans where only 100 share this opinion be considered unusual?
(a) No
(b) Yes
26.2 ± (2 × 4.4) = (17.4, 35)
26.2: a measure of center
2: a multiple
4.4: a measure of spread
result: a range of values
This procedure (and variants thereof) will reappear many times and
be studied much more in-depth in this course.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
http:// www.gallup.com/ poll/ 156974/ private-schools-top-marks-educating-children.aspx
19 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
20 / 57
Binomial distribution
Expected value and variability of successes
Binomial distribution
Expected value and variability of successes
An analysis of Facebook users
A recent study found that “Facebook users get more than they give”.
For example:
40% of Facebook users in our sample made a friend request, but
63% received at least one request
Users in our sample pressed the like button next to friends’
content an average of 14 times, but had their content “liked” an
average of 20 times
A recent study found that approximately 25% of all Facebook users
are power users. For a Facebook user with 245 friends, what is the
probability that he/she has 70 or more friends who are power users?
We are given that n = 245, p = 0.25, and we are asked for the
probability P (K ≥ 70).
P (K ≥ 70) = P (K = 70 or K = 71 or K = 72 or · · · or K = 245)
Users sent 9 personal messages, but received 12
= P (K = 70) + P (K = 71) + P (K = 72) + · · · + P (K = 245)
12% of users tagged a friend in a photo, but 35% were
themselves tagged in a photo
This seems like an awful lot of work...
Any guesses for how this pattern can be explained?
http:// www.pewinternet.org/ Reports/ 2012/ Facebook-users/ Summary.aspx
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Binomial distribution
July 7, 2015
21 / 57
Statistics 101 (Gary Larson)
Expected value and variability of successes
U2 - L2: Normal distribution
Binomial distribution
Histograms of number of successes
July 7, 2015
22 / 57
Expected value and variability of successes
We need some more tools...
Hollow histograms of samples from the binomial model where p =
0.10 and n = 10, 30, 100, and 300. What happens as n increases?
Ideally, we would like to be able to avoid adding up the heights of
all those little bars.
0
2
4
6
0
2
n = 10
4
6
8
So, instead of doing all that tediuousness, we are going to drape
a density curve over the histogram. (what’s a density curve, you
ask? hang on.)
10
n = 30
This turns our discrete distribution (binomial) into a continuous
distribution.
A very special continuous distribution indeed.
0
5
10
15
20
10
20
n = 100
Statistics 101 (Gary Larson)
30
40
50
n = 300
U2 - L2: Normal distribution
July 7, 2015
23 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
24 / 57
Binomial distribution
Expected value and variability of successes
Binomial distribution
But first...what’s a density curve?
Expected value and variability of successes
Imagining Density Curves
A density curve is a smoothed histogram where the total area under
the curve is 1.
We have a very large sample size. What does our histogram look like?
What does this mean the density curve will look like?
Measuring areas under a density curve corresponds to measuring
probabilities
To draw a density curve from a histogram simply connect the peaks of
a histogram with a smooth line, and normalize (i.e. adjust) the values of
the y-axis such that the area under the curve is 1.
0
2
4
6
0
2
n = 10
0
5
10
4
6
8
10
n = 30
15
20
10
n = 100
20
30
40
50
n = 300
This distribution is very important, so we pause in our binomial
discussion to explore it a little.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Introducting the Normal distribution
July 7, 2015
25 / 57
Normal distribution model
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Introducting the Normal distribution
Normal distribution
July 7, 2015
26 / 57
Normal distribution model
Heights of males
Denoted as N (µ, σ) → Normal with two parameters: mean µ and
standard deviation σ (or presented using variance σ2 )
Unimodal and symmetric, bell shaped curve, that also follows
well-known rules about how the data are distributed around µ
While many variables in the real world are distributed nearly
normal, virtually none are exactly normal
Arguably the most important distribution in the history of statistics
“The male heights on OkCupid very
nearly follow the expected normal
distribution – except the whole thing
is shifted to the right of where it
should be. Almost universally guys
like to add a couple inches.”
“You can also see a more subtle
vanity at work: starting at roughly 5’
8”, the top of the dotted curve tilts
even further rightward. This means
that guys as they get closer to six
feet round up a bit more than usual,
stretching for that coveted
psychological benchmark.”
http:// blog.okcupid.com/ index.php/ the-biggest-lies-in-online-dating/
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
27 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
28 / 57
Introducting the Normal distribution
Normal distribution model
Introducting the Normal distribution
Heights of females
68-95-99.7 Rule
68-95-99.7 Rule
68%
95%
99.7%
“When we looked into the data for
women, we were surprised to see
height exaggeration was just as
widespread, though without the
lurch towards a benchmark height.”
µ − 3σ µ − 2σ
µ−σ
µ
µ+σ
µ + 2σ µ + 3σ
For nearly normally distributed data,
about 68% falls within 1 SD of the mean,
about 95% falls within 2 SD of the mean,
about 99.7% falls within 3 SD of the mean.
It is possible for observations to fall 4, 5, or more standard
deviations away from the mean, but these occurrences are very
rare if the data are nearly normal.
http:// blog.okcupid.com/ index.php/ the-biggest-lies-in-online-dating/
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Introducting the Normal distribution
July 7, 2015
29 / 57
Statistics 101 (Gary Larson)
68-95-99.7 Rule
U2 - L2: Normal distribution
Introducting the Normal distribution
July 7, 2015
30 / 57
Standardizing with Z scores
SAT scores are distributed nearly normally with mean 1500 and standard deviation 300. ACT scores are distributed nearly normally with
mean 21 and standard deviation 5. A college admissions officer wants
to determine which of the two applicants scored better on their standardized test with respect to the other test takers: Pam, who earned
an 1800 on her SAT, or Jim, who scored a 24 on his ACT?
Describing variability using the 68-95-99.7 Rule
SAT scores are distributed nearly normally with mean 1500 and
standard deviation 300.
68%
Jim
Pam
95%
99.7%
600
900
1200
1500
1800
2100
2400
∼68% of students score between 1200 and 1800 on the SAT.
∼95% of students score between 900 and 2100 on the SAT.
600
900
1200 1500 1800 2100 2400
6
11
16
21
26
31
36
∼99.7% of students score between 600 and 2400 on the SAT.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
31 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
32 / 57
Introducting the Normal distribution
Standardizing with Z scores
Introducting the Normal distribution
Standardizing with Z scores
Standardizing with Z scores (cont.)
Since we cannot just compare these two raw scores, we instead
compare how many standard deviations beyond the mean each
observation is.
These are called standardized scores, or Z scores.
1800−1500
= 1 standard deviation above the mean.
300
24−21
= 0.6 standard deviations above the mean.
5
Pam’s score is
Jim’s score is
Standardizing with Z scores
Z score of an observation is the number of standard deviations it
falls above or below the mean.
Z scores
Z=
Jim Pam
observation − mean
SD
Z scores are defined for distributions of any shape, but
(BONUS!) when the distribution is normal, we can use Z scores
to calculate percentiles / areas under the normal distribution
density curve (i.e. probabilities!).
−2
−1
Statistics 101 (Gary Larson)
0
1
U2 - L2: Normal distribution
Introducting the Normal distribution
Observations that are more than 2 SD away from the mean
(|Z | > 2) are usually considered unusual.
2
July 7, 2015
33 / 57
Standardizing with Z scores
1200
34 / 57
Standardizing with Z scores
1500
1800
2100
Percentile is the percentage of observations that fall below a
given data point.
Graphically, percentile is the area below the probability
distribution curve to the left of that observation.
2400
600
Statistics 101 (Gary Larson)
July 7, 2015
Percentiles
Approximately what percent of students score below 1800 on the SAT?
The mean SAT score is 1500, with a standard deviation of 300
(Hint: Use the 68-95-99.7% rule.)
900
U2 - L2: Normal distribution
Introducting the Normal distribution
Approximating percentiles
600
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
35 / 57
Statistics 101 (Gary Larson)
900
1200
1500
1800
U2 - L2: Normal distribution
2100
2400
July 7, 2015
36 / 57
Introducting the Normal distribution
Standardizing with Z scores
Introducting the Normal distribution
Standardizing with Z scores
Z-Scores, N (µ, σ2 ), and N (0, 1)
Calculating percentiles - using computation
The standard normal distribution is defined as N (0, 1): the normal
distribution with µ = 0 and σ = 1.
There are many ways to compute percentiles/areas under the curve:
R:
Very useful theoretical result:
> pnorm(1800, mean = 1500, sd = 300)
[1] 0.8413447
If a random variable X ∼ N (µ, σ2 ), then the random variable
Z = (X − µ)/σ is distributed N (0, 1)
Applet: http:// www.socr.ucla.edu/ htmls/ SOCR Distributions.html
The above implies that if your original data is (approximately) normally
distributed, the Z scores are distributed (approximately) N (0, 1).
“A z-score puts values from any normal distribution N (µ, σ2 ) onto a
common scale” ... by “converting” the N (µ, σ2 ) values to values from
the standard normal distribution N (0, 1).
This means, if we had N (0, 1) percentiles, we could use them to
calculate percentiles for any N (µ, σ2 ) distribution.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Introducting the Normal distribution
July 7, 2015
37 / 57
Statistics 101 (Gary Larson)
Standardizing with Z scores
U2 - L2: Normal distribution
Introducting the Normal distribution
July 7, 2015
38 / 57
Standardizing with Z scores
Calculating percentiles - using tables
Participation question
Which of the following is false?
(a) Z scores are helpful for determining how unusual a data point is
compared to the rest of the data in the distribution.
(b) Majority of Z scores in a right skewed distribution are negative.
(c) Regardless of the shape of the distribution (symmetric vs.
skewed) the Z score of the mean is always 0.
(d) In a normal distribution, Q1 and Q3 are more than one SD away
from the mean.
Z
0.00
0.01
0.02
Second decimal place of Z
0.03 0.04
0.05 0.06
0.07
0.08
0.09
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.5000
0.5040
0.5080
0.5120
0.5160
0.5199
0.5239
0.5279
0.5319
0.5359
0.5398
0.5438
0.5478
0.5517
0.5557
0.5596
0.5636
0.5675
0.5714
0.5753
0.5793
0.5832
0.5871
0.5910
0.5948
0.5987
0.6026
0.6064
0.6103
0.6141
0.6179
0.6217
0.6255
0.6293
0.6331
0.6368
0.6406
0.6443
0.6480
0.6517
0.6554
0.6591
0.6628
0.6664
0.6700
0.6736
0.6772
0.6808
0.6844
0.6879
0.6915
0.6950
0.6985
0.7019
0.7054
0.7088
0.7123
0.7157
0.7190
0.7224
0.7257
0.7291
0.7324
0.7357
0.7389
0.7422
0.7454
0.7486
0.7517
0.7549
0.7580
0.7611
0.7642
0.7673
0.7704
0.7734
0.7764
0.7794
0.7823
0.7852
0.7881
0.7910
0.7939
0.7967
0.7995
0.8023
0.8051
0.8078
0.8106
0.8133
0.8159
0.8186
0.8212
0.8238
0.8264
0.8289
0.8315
0.8340
0.8365
0.8389
1.0
1.1
1.2
0.8413
0.8438
0.8461
0.8485
0.8508
0.8531
0.8554
0.8577
0.8599
0.8621
0.8643
0.8665
0.8686
0.8708
0.8729
0.8749
0.8770
0.8790
0.8810
0.8830
0.8849
0.8869
0.8888
0.8907
0.8925
0.8944
0.8962
0.8980
0.8997
0.9015
Z-score = 1
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
39 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
40 / 57
Introducting the Normal distribution
Standardizing with Z scores
Introducting the Normal distribution
Standardizing with Z scores
Example
What percent of the standard normal distribution is above Z = 0.82?
Choose the closest answer.
(a) 79.4%
The average daily temperature in June in LA is 77 F, with a standard
deviation of 5 degrees. Suppose the temperatures in June closely
follow a normal distribution. What is the probability of observing a
temperature of at most 83 F on a randomly chosen day in June? )
(b) 20.6%
T ∼ N (mean = 77, sd = 5)
(c) 82%
(d) 18%
(e) Need to be provided the mean and the standard deviation of the
distribution in order to be able to solve this problem.
!
83 − 77
P (T ≤ 83) = P Z ≤
= P (Z ≤ 1.2) ≈ 0.885
5
The probability of observing a temperature of at most 83 F on a
randomly chosen day in June is approximately 0.885, or 88.5%.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Introducting the Normal distribution
July 7, 2015
41 / 57
Standardizing with Z scores
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
42 / 57
Normal approximation to the binomial
Example (cont)
The average daily temperature in June in LA is 77 F, with a standard
deviation of 5 degrees. Suppose the temperatures in June closely
follow a normal distribution. What is the probability of observing a
temperature of at least 83 F on a randomly chosen day in June? )
A recent study found that approximately 25% of all Facebook users
are power users. For a Facebook user with 245 friends, what is the
probability that he/she has 70 or more friends who are power users?
We are given that n = 245, p = 0.25, and we are asked for the
probability P (K ≥ 70).
T ∼ N (mean = 77, sd = 5)
P (K ≥ 70) = P (K = 70 or K = 71 or K = 72 or · · · or K = 245)
= P (K = 70) + P (K = 71) + P (K = 72) + · · · + P (K = 245)
P (T ≥ 83) = 1 − P (T ≤ 83) ≈ 1 − 0.885 = 0.115
This seems like an awful lot of work...
The probability of observing a temperature of at least 83 F on a
randomly chosen day in June is approximately 0.115, or 11.5%.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
43 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
44 / 57
Normal approximation to the binomial
Normal approximation to the binomial
Histograms of number of successes
Density Curves
Hollow histograms of samples from the binomial model where p =
0.10 and n = 10, 30, 100, and 300. What happens as n increases?
0
2
4
6
0
2
4
n = 10
0
5
10
8
10
n = 30
15
20
10
20
30
n = 100
Statistics 101 (Gary Larson)
6
A density curve is a smoothed histogram where the total area
under the curve is 1.
Measuring areas under a density curve corresponds to
measuring probabilities
To draw a density curve from a histogram simply connect the
peaks of a histogram with a smooth line, and normalize (i.e.
adjust) the values of the y-axis such that the area under the
curve is 1.
40
50
n = 300
U2 - L2: Normal distribution
July 7, 2015
45 / 57
Normal approximation to the binomial
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
46 / 57
Normal approximation to the binomial
Normal approximation to the binomial
How large is large enough?
When the sample size is large enough, the binomial distribution with
parameters n and p can be approximated
by the normal model with
p
parameters µ = np and σ = np (1 − p ).
In the case of the Facebook power users, n = 245 and p = 0.25.
µ = 245 × 0.25 = 61.25
σ=
√
To use the normal approximation instead of the binomial distribution,
the sample size must be large enough; n is large enough if the
expected number of successes (np) and failures (n(1 − p )) are both at
least 10.
np ≥ 10
and
n(1 − p ) ≥ 10
245 × 0.25 × 0.75 = 6.78
Bin(n = 245, p = 0.25) ≈ N (µ = 61.25, σ = 6.78).
0.06
Bin(245,0.25)
N(61.5,6.78)
0.05
0.04
0.03
0.02
0.01
0.00
20
40
60
80
100
k
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
47 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
48 / 57
Normal approximation to the binomial
Normal approximation to the binomial
Normal approximation to the binomial
When the sample size is large enough, the binomial distribution with
parameters n and p can be approximated
by the normal model with
p
parameters µ = np and σ = np (1 − p ).
Participation question
Below are the parametres for four different binomial distribution parameters. Which one can be approximated by the normal distribution?
In the case of the Facebook power users, n = 245 and p = 0.25.
(a) n = 100, p = 0.95
Bin(n = 245, p = 0.25) ≈ N (µ = 61.25, σ = 6.78).
µ = 245 × 0.25 = 61.25
(b) n = 25, p = 0.45
0.06
(c) n = 150, p = 0.05
0.05
(d) n = 500, p = 0.015
0.04
σ=
√
245 × 0.25 × 0.75 = 6.78
Bin(245,0.25)
N(61.5,6.78)
0.03
0.02
0.01
0.00
20
40
60
80
100
k
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
49 / 57
Statistics 101 (Gary Larson)
Normal approximation to the binomial
July 7, 2015
50 / 57
Normal approximation to the binomial
What is the probability that the average Facebook user with 245 friends
has 70 or more friends who would be considered power users?
(a) 0.0251
(c) 0.1128
(b) 0.0985
(d) 0.9015
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
U2 - L2: Normal distribution
July 7, 2015
51 / 57
What is the probability that the average Facebook user with 245 friends
has 70 or more friends who would be considered power users?
(a) 0.0251
(c) 0.1128
(b) 0.0985
(d) 0.9015
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
52 / 57
Application exercises
Finding probabilities // Quality control
Application exercises
At Heinz ketchup factory the amounts which go into bottles of ketchup are supposed
to be normally distributed with mean 36 oz. and standard deviation 0.11 oz. Once
every 30 minutes a bottle is selected from the production line, and its contents are
Finding cutoff points // Hot bodies
Body temperatures of healthy humans are distributed nearly normally with mean
98.2◦ F and standard deviation 0.73◦ F. What is the cutoff for the highest 10% of human
body temperatures?
noted precisely. If the amount of the bottle goes below 35.8 oz. or above 36.2 oz.,
then the bottle fails the quality control inspection. What percent of bottles pass the
Mackowiak, Wasserman, and Levine (1992), A Critical Appraisal of 98.6 Degrees F, the Upper Limit of the Normal Body
quality control inspection?
Temperature, and Other Legacies of Carl Reinhold August Wunderlick.
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
Application exercises
July 7, 2015
53 / 57
Statistics 101 (Gary Larson)
Conditional probability // SAT scores
U2 - L2: Normal distribution
Application exercises
July 7, 2015
54 / 57
Finding missing parameters // Auto insurance premiums
SAT scores (out of 2400) are distributed normally with mean 1500 and standard deviation 300. Suppose a school council awards a certificate of excellence to all students
who score at least 1900 on the SAT. What percent of the students who received this
certificate scored above 2100?
P (SAT > 2100 | SAT > 1900) =
=
P (SAT > 2100) =
=
Suppose a newspaper article states that the distribution of auto insurance premiums
for residents of California is approximately normal with a mean of $1,650. The article
also states that 25% of California residents pay more than $1,800.
P (SAT > 2100 and SAT > 1900)
P (SAT > 1900)
P (SAT > 2100)
P (SAT > 1900)
!
2100 − 1500
P
300
P (Z > 2) = 1 − 0.9772 = 0.0228
1. What is the standard deviation of this distribution?
2. What is the IQR of this distribution?
P (SAT > 1900) = P (Z > 1.33) = 1 − 0.9082 = 0.0918
0.0228
P (SAT > 2100 | SAT > 1900) =
≈ 0.25 → 25% of students
0.0918
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
55 / 57
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
56 / 57
To Do
To Do
PS 3 due tomorrow in class
Reading assignment, by Thursday:
Chapter 4 Sections 4.1 - 4.2.3 (A sampling distribution for the
mean)
Statistics 101 (Gary Larson)
U2 - L2: Normal distribution
July 7, 2015
57 / 57