Download population

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Last lecture summary
3
8
4
6
5
7
3
3
4
2
3
2
4
3
0
5
4
4
4
3
Population 2015
Population 2014
3
8
4
průměr = 3.3
6
5
7
3
3
4
2
3
2
4
3
0
průměr = 3.0
5
4
4
4
3
Data 2015
Population:
4,3,3,5,0,4,4,4,3,4,2,6,8,2,4,3,5,7,3,3
25 samples (n=3) and their averages
3.3,5.3,3.6,4.3,2.3,3.0,3.6,3.0,5.3,5.6,3.3,4.3,3.3,4.0,5.6,4.3
,4.3,4.6,6.3,3.3,4.0,3.3,4.6,3.0,4.3
http://blue-lover.blog.cz/1106/lentilky
2015, n = 3, number of samples = 25
2015, n = 3, number of samples = 50
2015, n = 3, number of samples = 300
2015, n = 3, all possible samples (1540)
2015, n = 5, all possible samples (42 504)
2015, n = 10, all possible samples (20 030 010)
Central limit theorem
• Distribution of sample means is normal.
• The distribution of means will increasingly approximate a normal
distribution as the sample size 𝑛 increases.
• Its mean 𝑀 is equal to population mean.
𝑀 = 𝜇𝑥 = 𝜇
• Its standard deviation 𝑆𝐸 is equal to population standard
deviation divided by the square root of 𝑛.
• 𝑆𝐸 is called standard error.
𝜎
𝑆𝐸 = 𝜎𝑥 =
𝑛
ESTIMATION,
CONFIDENCE INTERVALS
Statistical inference
If we can’t conduct a census, we
collect data from the sample of a
population.
Goal: make conclusions about
that population
Demonstration
• You sample 36 apples from your farm’s harvest of over
200 000 apples. The mean weight of the sample is 112
grams (with a 40 gram sample standard deviation).
• What is the probability that the mean weight of all 200 000
apples is within 100 and 124 grams?
What is the question?
• We would like to know the probability that the population
mean is within 12 of the sample mean.
𝑃 𝜇 is within 12 of 𝑥
• But this is the same thing as
𝑃 𝑥 is within 12 of 𝜇
• But this is the same thing as
𝑃 𝑥 is within 12 of 𝑀
• So, if I am able to say how many standard deviations
away from 𝑀 I am, I can use the Z-table to figure out the
probability.
Slight complication
• There is one caveat, can you see it?
• We don’t know the standard deviation of a sampling
distribution (standard error). We only know it equals to
𝜎 6, but 𝜎 is uknown.
• What we’re going to do is to estimate 𝜎. Best thing we
can do is to use sample standard deviation 𝑠 = 40.
• 𝑆𝐸 =
40
6
≈ 6.67. This is our best estimate of a standard
error.
• Now you finish the example. What is the probability that
population mean lies within 12 of the sample if the SE
equals to 6.67?
• 92.82%
This is neat!
• You sample 36 apples from your farm’s harvest of over
200 000 apples. The mean weight of the sample is 112
grams (with a 40 gram sample standard deviation). What
is the probability that the population mean weight of all
200 000 apples is within 100 and 124 grams?
• We started with very little information (we know just the
sample statistics), but we can infere that
with the probability of 92.82% a population mean lies
within 12 of our sample mean!
Point vs. interval estimate
• You sample 36 apples from your farm’s harvest of over
200 000 apples. The mean weight of the sample is 112
grams (with a 40 gram sample standard deviation).
• Goal: estimate population mean
1. Population mean is estimated as sample mean. i.e. we
say population mean equals to 112 g. This is called a
point estimate (bodový odhad).
2. However, we can do better. We can estimate that our
true population mean will lie with the 95% confidence
within an interval of (interval estimate).
𝑠
𝑥 ± 1.96 ×
𝑛
Confidence interval
• This type of result is called a confidence interval
(interval spolehlivosti, konfidenční interval).
𝑠
𝑥±𝑍×
𝑛
critical value
kritická hodnota
margin of error
možná odchylka
• The number of stadandard errors you want to
add/subtract depends on the confidence level (e.g. 95%)
(hladina spolehlivosti).
Confidence level
• The desired level of confidence is set by the researcher,
not determined by data.
• If you want to be 95% confident with your results, you add/subtract
1.96 standard errors (empirical rule says about 2 standard errors).
• 95% interval spolehlivosti
Confidence level Z-value
80
1.28
90
1.64
95
1.96
98
2.33
99
2.58
80%
90%
1.28
1.64
95%
99%
1.96
2.58
Small sample size confidence intervals
• 7 patient’s blood pressure have been measured after
having been given a new drug for 3 months. They had
blood pressure increases of 1.5, 2.9, 0.9, 3.9, 3.2, 2.1 and
1.9. Construct a 95% confidence interval for the true
expected blood pressure increase for all patients in a
population.
• We will assume that our population distribution is normal,
with 𝜇 and 𝜎.
• We don’t know anything about this distribution but we
have a sample. Let’s figure out everything you can figure
out about this sample:
• 𝑥 = 2.34, 𝑠 = 1.04
• We estimate true population standard deviation with
sample standard deviation
𝜎 ≈ 𝑠 = 1.04
• However, we are estimating our standard deviation with 𝑛
of only seven! This is probably goint to be not so good
estimate.
• In general, if 𝒏 < 𝟑𝟎 this is considered a bad estimate.
William Sealy Gosset aka Student
• 1876-1937
• an employee of Guinness
brewery
• 1908 papers addressed the
brewer's concern with small
samples
• "The probable error of a mean".
Biometrika 6 (1): 1–25. March 1908.
• Probable error of a correlation
coefficient". Biometrika 6 (2/3): 302–
310. September 1908.
Student t-distribution
• Instead of assuming a sampling distribution is normal we
will use a Student t-distribution.
• It gives a better estimate of your confidence interval if you
have a small sample size.
• It looks very similar to a normal distribution, but it has
fatter tails to indicate the higher frequency of outliers
which come with a small data set.
Student t-distribution
Student t-distribution
df – degree of freedom
(stupeň volnosti)
df = 𝑛 − 1
Back to our case
• 𝑥 = 2.34, 𝑠 = 1.04, 𝑛 = 7
• Because sample size is small, the sampling distribution of
the mean won’t be normal. Instead, it will have a Student
t-distribution with d. f. = 6.
• Construct a 95% confidence interval, please
2.34 ± 2.447 ×
1.04
7
= 2.34 ± 0.962
𝑠
for 𝑛 < 30: 𝑥 ± 𝑡𝑛−1 ×
𝑛
neco ×
•
Just to summarize, the margin of error depends on
the confidence level (common is 95%)
2. the sample size 𝑛
1.
•
•
3.
the variability of the data (i.e. on σ)
•
•
•
as the sample size increases, the margin of error decreases
For the bigger sample we have a smaller interval for which
we’re pretty sure the true population lies.
more variability increases the margin of error
Margin of error does not measure anything else
than chance variation.
It doesn’t measure any bias or errors that happen
during the proces.
•
It does not tell anything about the correctness of your
data!!!
𝑠
𝑛