Download Applied Data Analysis - University of Rochester

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Opinion poll wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Applied Data Analysis
Spring 2017
Yemen, 2016
Karen Albert
[email protected]
Thursdays, 4-5 PM (Hark 302)
Lecture outline
1. Percentages
2. Central limit theorem
3. Inference
Let’s do some political science
McClatchy-Marist poll released yesterday: 17% said that they
trust President Trump to "deliver accurate and factual
information to the public."
• The percentage that is reported is a mean.
• More specifically, it is a mean of 1’s and 0’s.
• So the percentage is the expected value.
• We can then figure out the standard error.
The margin of error is twice the standard error. (Why twice?)
Let’s do some political science
McClatchy-Marist poll released yesterday: 17% said that they
trust President Trump to "deliver accurate and factual
information to the public."
• The percentage that is reported is a mean.
• More specifically, it is a mean of 1’s and 0’s.
• So the percentage is the expected value.
• We can then figure out the standard error.
The margin of error is twice the standard error. (Why twice?)
The poll number are therefore: expected value ± (2xS.E.)
A short-cut to the SD for a special case
For boxes with only two kinds of tickets, there is short-cut to
finding the standard deviation.
S = (big − small) ×
p
big fraction × small fraction
A short-cut to the SD for a special case
For boxes with only two kinds of tickets, there is short-cut to
finding the standard deviation.
S = (big − small) ×
p
big fraction × small fraction
Consider a box with the values: 1, 1, 1, 5.
The standard deviation is
r
S = (5 − 1) ×
1 3
× ≈ 1.73
4 4
The short-cut
x <- c(1,1,1,5)
sd(x)*(3/4)
(5-1)*sqrt(.25*.75)
The short-cut
x <- c(1,1,1,5)
sd(x)*sqrt(3/4)
## [1] 1.732051
(5-1)*sqrt(.25*.75)
## [1] 1.732051
Another example
If you bet a dollar on a single number at Nevada roulette, and
that number comes up, you get the $1 back together with
winnings of $35. If any other number comes up, you lose the
dollar. Suppose you play 100 times, betting a dollar on the
number 17 each time.
Another example
If you bet a dollar on a single number at Nevada roulette, and
that number comes up, you get the $1 back together with
winnings of $35. If any other number comes up, you lose the
dollar. Suppose you play 100 times, betting a dollar on the
number 17 each time.
What is your chance of breaking even?
Another example
If you bet a dollar on a single number at Nevada roulette, and
that number comes up, you get the $1 back together with
winnings of $35. If any other number comes up, you lose the
dollar. Suppose you play 100 times, betting a dollar on the
number 17 each time.
What is your chance of breaking even?
Hint: there are 38 spaces on a roulette wheel.
The box model and the expected value
The box has 38 tickets in it: 1 worth 35 and 37 worth -1.
The mean of the box is
35*(1/38)+(-1)*(37/38)
## [1] -0.05263158
The standard deviation
Since there are only two kinds of tickets in the box, we can use
the short-cut.
(35-(-1))*sqrt((1/38)*(37/38))
## [1] 5.762617
So the standard error is
5.76/sqrt(100)
## [1] 0.576
End of the example
What is the probability that you break even?
The mean (expected value) is -0.0526.
The standard deviation (standard error) is 0.576.
End of the example
What is the probability that you break even?
The mean (expected value) is -0.0526.
The standard deviation (standard error) is 0.576.
pnorm(0,-.0526,.576,lower.tail=FALSE)
## [1] 0.4636194
So the probability of breaking even is roughly 46%.
SRS and chance error
Let’s consider the percentage of Democrats in a sample.
Assume that we took a simple random sample.
With an SRS, the expected value for the sample percentage
equals the population percentage.
It won’t be exact, of course. The amount it is off is governed by
chance, and we can use a box model to understand it.
How BIG?
% of 1’s in the sample = % of 1’s in the box + chance error
How BIG?
% of 1’s in the sample = % of 1’s in the box + chance error
How big is the error likely to be?
How BIG?
% of 1’s in the sample = % of 1’s in the box + chance error
How big is the error likely to be?
The standard error tells us.
How BIG?
% of 1’s in the sample = % of 1’s in the box + chance error
How big is the error likely to be?
The standard error tells us.
Calculation:
1. Find the standard deviation for the number of 1’s in the
sample.
2. Find the standard error.
3. Convert to a percentage.
Calculating the standard error (part one)
Consider a box with 3,091 1’s and 3581 0’s. Assume that we
take 100 draws from the box.
Calculating the standard error (part one)
Consider a box with 3,091 1’s and 3581 0’s. Assume that we
take 100 draws from the box.
The fraction of 1’s in the box is 0.46.
Calculating the standard error (part one)
Consider a box with 3,091 1’s and 3581 0’s. Assume that we
take 100 draws from the box.
The fraction of 1’s in the box is 0.46.
The standard deviation is therefore,
sqrt(.46*.54)
## [1] 0.4983974
Calculating the standard error (part deux)
Now that we have the SD, the standard error is,
sqrt(.46*.54/100)
## [1] 0.04983974
The percentage of 1’s in the sample is 46% give or take 5% or
so.
What if the sample size increases?
First, what happens to the standard error of the sum?
What if the sample size increases?
First, what happens to the standard error of the sum?
√
100 × 0.5 = 5
What if the sample size increases?
First, what happens to the standard error of the sum?
√
√
100 × 0.5 = 5
400 × 0.5 = 10
What if the sample size increases?
First, what happens to the standard error of the sum?
√
√
100 × 0.5 = 5
400 × 0.5 = 10
Now, what about the standard error for the percentage?
What if the sample size increases?
First, what happens to the standard error of the sum?
√
√
100 × 0.5 = 5
400 × 0.5 = 10
Now, what about the standard error for the percentage?
0.5
√
100
= 0.05
What if the sample size increases?
First, what happens to the standard error of the sum?
√
√
100 × 0.5 = 5
400 × 0.5 = 10
Now, what about the standard error for the percentage?
0.5
√
100
0.5
√
400
= 0.05
= 0.025
The normal curve yet again
Remember a box filled with 3,091 1’s and 3,581 0’s.
The percentage of 1’s in the sample is 46% give or take 5% or
so.
Can we estimate the chances of the percentage being greater
than 56%?
The normal curve yet again
Remember a box filled with 3,091 1’s and 3,581 0’s.
The percentage of 1’s in the sample is 46% give or take 5% or
so.
Can we estimate the chances of the percentage being greater
than 56%?
We did this with the normal curve before, but the data this time
are not normal....
The Central Limit Theorem
When drawing at random with replacement from a box, the
sampling distribution of the sample mean will follow the normal
curve, even if the contents of the box do not.
The Central Limit Theorem
When drawing at random with replacement from a box, the
sampling distribution of the sample mean will follow the normal
curve, even if the contents of the box do not.
There are conditions:
1. the sampling distribution must be put into standard units,
2. the number of draws must be reasonably large.
Central Limit Theorem
• Consider a variable uniformly distributed between 0 and
10.
• Suppose that we take a random sample of size n from this
distribution and compute the mean.
• Next, we put the mean in standard units by dividing by the
square root of n.
• Repeat this process 4000 times.
• What would the resulting sampling distribution look like?
Sample size=1
Sample size=5
Sample size=10
Sample size=30
Actual inference
OK, because I’m a nice guy, I keep telling you the contents of
the box.
Actual inference
OK, because I’m a nice guy, I keep telling you the contents of
the box.
What if I don’t tell you the contents of the box, but only the
draws from the box?
Actual inference
OK, because I’m a nice guy, I keep telling you the contents of
the box.
What if I don’t tell you the contents of the box, but only the
draws from the box?
This is the essence of inference.
Actual inference
OK, because I’m a nice guy, I keep telling you the contents of
the box.
What if I don’t tell you the contents of the box, but only the
draws from the box?
This is the essence of inference.
Using the draws to learn about the box is just like using a
sample to learn about a population.
Box to sample
If we don’t know the contents of the box, we have to substitute
the fractions observed in the sample for the unknown fractions
in the box.
Example
In a certain city, there are 100,000 persons age 18 to 24. A
simple random sample of 500 such persons is drawn, of whom
194 turn out to be currently enrolled in college. Estimate the
percentage of all persons age 18 to 24 in that city who are
currently enrolled in college.
Example
In a certain city, there are 100,000 persons age 18 to 24. A
simple random sample of 500 such persons is drawn, of whom
194 turn out to be currently enrolled in college. Estimate the
percentage of all persons age 18 to 24 in that city who are
currently enrolled in college.
The sample percentage is,
194
= 0.388
500
The standard deviation
The SD of the box, using the observed fractions, is,
p
% of 10 s × % of 00 s
p
=
0.388(1 − 0.388)
s =
= 0.487
The standard error
The standard error is therefore
0.487
√
500
= 0.0218
s.e. =
The standard error
The standard error is therefore
0.487
√
500
= 0.0218
s.e. =
The solution is therefore 38.8% give or take 2.18%.
Standard error for a percentage
r
s.e. =
p̂(1 − p̂)
n
• p̂ is the observed fraction of “successes” in the sample.
• n is the number of draws (the size of the sample).
Standard error for a percentage
r
s.e. =
p̂(1 − p̂)
n
• p̂ is the observed fraction of “successes” in the sample.
• n is the number of draws (the size of the sample).
Interpretation: average distance of the p̂ values from the true,
but unknown, percentage.
Another example
There are 4000 undergraduates at the University of Rochester.
A simple random sample of 400 students is drawn, of whom 20
turn out to like Nickelback. Estimate the percentage of all
students at UR that like Nickelback.
Solution
The sample percentage is p̂ = 20/400 = 0.05.
Solution
The sample percentage is p̂ = 20/400 = 0.05.
The standard error is
r
0.05(1 − 0.05)
400
= 0.0109
s.e. =
Solution
The sample percentage is p̂ = 20/400 = 0.05.
The standard error is
r
0.05(1 − 0.05)
400
= 0.0109
s.e. =
The solution is therefore 5% give or take 1.09%.
What did we learn?
• Percentages
• Central limit theorem
• Inference