Download Lecture Powerpoint presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Central limit theorem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Sampling and hypothesis testing
Eoin O’Malley
School of Law and Government,
DCU
Hypotheses
• An hypothesis is a proposal or a
candidate model
• In statistics Null and Alternative
hypotheses are used
– H0 and HA
– e.g. H0: X=Y and Ha: X≠ Y
Normal distributions
Galton’s Bean Machine
Z Scores
• Remember that the normal distribution has
some qualities that allow us to say this
• We know that 95% of the data is within
1.96 SDs of the mean (population) etc.
• The 1.96 comes from the Z distribution
• So if we want to say with 95% confidence
that a sample is (un)likely to come from a
(known) population we can use Z scores
Z scores: a (hypothetical) example
• Suppose support for the largest party in
Ireland is = 40.5
• Say, the mean for all of west European
lead parties is 29% (N= 410 incl. Ire.)
std dev = 4.9
• Is Ireland unusual?
Z-score example (cont.)
The formula to work out a z-score is;
Z= x–μ
σ
Here, this is = 40.5- 29
4.9
= 2.346
The z-table tells us that .009 will have higher
Is Ireland unusual?
What if we want to compare means?
• We may have a sample from the population, say
ten Irish post war top party support levels
– (This obviously isn’t a random sample, but what we
might want to test is whether it may as well be
random)
• Then we use the Standard Error of the Mean
• This bit is NB!!!
The standard deviation and standard error
of the mean
Central Limit Theorem
• (Roughly stated) says that as sample sizes get
larger, the sampling distribution for a variable X
approaches a normal distribution with a mean
equal to the population, and a standard
deviation equal to the population standard
deviation divided by the square root of the
sample size
– S. Lynch 2000
Sampling distribution
• When we collect a sample, n=1or n=1000
and take the mean we have one data point
in the sampling distribution of the mean
• When we do it many times we have the
distribution which is becomes normally
distributed as the sample increases
Central Limit Theorem
Sampling Distribution for n=1
80
freq
60
40
20
0
0
0.5
m ean
1
Central Limit Theorem
freq
Sampling Distribution for n=2
120
100
80
60
40
20
0
0
0.5
m ean
1
Central Limit Theorem
Sampling Distribution for n=10
250
freq
200
150
100
50
0
0
0.5
m ean
1
Central Limit Theorem
Sampling Distribution for n=100
500
freq
400
300
200
100
0
0
0.5
m ean
1
Central Limit Theorem
Empirical and Theoretical Standard Deviations of Sampling
Distributions for U(0,1) by Sample Size
0.35
0.3
S.D.
0.25
0.2
0.15
0.1
0.05
0
0
10
20
30
40
50
60
70
Sample Size
Empirical s.d.
Theoretical s.d.
80
90
100
Z-test
Z=
–μ
σ/ √ n
• In our example –
38.4 – 29
4.9/ √ 10
≈
9.4
1.5
• The z-tables have a p value <.001
≈ 6.266
What does this mean?
• The p (or alpha level) is a probability
• What is the probability of having a sample
statistic of this magnitude if the null hypothesis is
true?
• So it is quite improbable that if Ireland came
from the European population that you’d get
data like this by chance. Ireland is probably
different
• We say we reject the null hypothesis that Ireland
= Europe at the .05 level for α.
For samples
• Student t-test - the t-statistic is worked
out by
t=
– μ0
sX / √ n
-We then look up the t distribution like
we did the Z
-However there are many t-distributions
Example (comparing two groups)
• Suppose we didn’t have time or resources
to collect all the data on first placed parties
• Instead we took a random sample (incl.
Ireland)
• The figures are
– Mean (Ire) = 40.5 (n=6) sd = 6.6;
– Mean (rest) = 29.6 (n=70) sd = 9.3)
Example
Here we are comparing the (unpaired)
difference of two means, so…
t=
X-bar1 – x-bar2
sX1 / √ n1+ sX2 / √ n2
Output in Stata
. ttest europe= ireland, unpaired
Two-sample t test with equal variances
Variable
europe |
ireland |
Obs
70
5
Mean
29.5959
40.504
Std. Err.
1.111507
2.941091
Ho: mean(europe)- mean(ireland) = diff =
Ha: diff < 0
Ha: diff ~= 0
t = -2.5693
t = -2.5693
P < t = 0.0061
P > |t| = 0.012
Std. Dev.
9.299532
6.576479
0
Ha: diff > 0
t = -2.5693
P > t = 0.9939
Types of errors
• Type I and Type II
• Type I is claiming a relationship that in fact
doesn’t exist (convict an innocent man)
• Type II is rejecting a relationship that
actually is the truth (release a guilty man)
• Type I is usually thought of being worse
than Type II
In stata
• Open nes2004.dta in stata