Download Statistics 2 - Damian Gordon

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Quantitative Data Analysis:
Statistics – Part 2
Overview

Part 1





Picturing the Data
Pitfalls of Surveys
Averages
Variance and Standard Deviation
Part 2




The Normal Distribution
Z-Tests
Confidence Intervals
T-Tests
The Normal Distribution
The Normal Distribution

Imagine we asked 100 employees of an
organisation to rate their satisfaction with their job
on a scale of 1 to 10, and we plotted it:
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution

Is it likely we’d get an even distribution across all 10
scale points?
10
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution
Not really!
The Normal Distribution

Let’s imagine it’s a crappy organisation and no one
likes working there, then we’d get this sort of
distribution:
40
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution

Or let’s imagine it’s a great organisation, then we’d
get this sort of distribution:
40
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution

But what if it’s a middling sort of organisation that
just average to work for?
The Normal Distribution

Then we’ll get this:
40
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution

Which looks like this
40
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution

As does this:
40
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution

As does this:
40
Number
Of people
1
2
3
4
5
6
7
8
9
Scale
10
The Normal Distribution


Abraham de Moivre, the 18th century statistician
and consultant to gamblers was often called upon
to make lengthy computations about coin flips. de
Moivre noted that when the number of events (coin
flips) increased, the shape of the binomial
distribution approached a very smooth curve.
In 1809 Carl Gauss developed the formula for the
normal distribution and showed that the distribution
of many natural phenomena are at least
approximately normally distributed.
Abraham de Moivre





Born 26 May 1667
Died 27 November 1754
Born in Champagne, France
wrote a textbook on
probability theory, "The
Doctrine of Chances: a
method of calculating the
probabilities of events in
play". This book came out in
four editions, 1711 in Latin,
and 1718, 1738 and 1756 in
English.
In the later editions of his
book, de Moivre gives the
first statement of the formula
for the normal distribution
curve.
Carl Friedrich Gauss




Born 30 April 1777
Died 23 February 1855
Born in Lower Saxony, Germany
In 1809 Gauss published the
monograph “Theoria motus
corporum coelestium in
sectionibus conicis solem
ambientium” where among other
things he introduces and
describes several important
statistical concepts, such as the
method of least squares, the
method of maximum likelihood,
and the normal distribution.
The Normal Distribution
The Normal Distribution







Age of students in a class
Body temperature
Pulse rate
Shoe size
IQ score
Diameter of trees
Height?
The Normal Distribution
The Normal Distribution
Density Curves: Properties
The Normal Distribution




The graph has a single peak at the
center, this peak occurs at the mean
The graph is symmetrical about the
mean
The graph never touches the
horizontal axis
The area under the graph is equal to 1
Characterization



A normal distribution
is bell-shaped and
symmetric.
The distribution is
determined by the
mean mu, m, and the
standard deviation
sigma, s.
The mean mu
controls the center
and sigma controls
the spread.
Same Mean,
Different Standard Deviation
1
10
Different Mean,
Different Standard Deviation
1
10
Different Mean,
Same Standard Deviation
1
10
The Normal Distribution

If a variable is normally distributed,
then:



within one standard deviation of the mean there
will be approximately 68% of the data
within two standard deviations of the mean there
will be approximately 95% of the data
within three standard deviations of the mean
there will be approximately 99.7% of the data
The Normal Distribution
Why?

One reason the normal distribution is
important is that many psychological and
organsational variables are distributed
approximately normally. Measures of
reading ability, introversion, job satisfaction,
and memory are among the many
psychological variables approximately
normally distributed. Although the
distributions are only approximately normal,
they are usually quite close.
Why?

A second reason the normal distribution is
so important is that it is easy for
mathematical statisticians to work with. This
means that many kinds of statistical tests
can be derived for normal distributions.
Almost all statistical tests discussed in this
text assume normal distributions.
Fortunately, these tests work very well even
if the distribution is only approximately
normally distributed. Some tests work well
even with very wide deviations from
normality.
So what?

Imagine we undertook an experiment
where we measured staff productivity
before and after we introduced a
computer system to help record
solutions to common issues of work


Average productivity before = 6.4
Average productivity after = 9.2
So what?
0
Before = 6.4
After = 9.2
10
So what?
0
Before = 6.4
After = 9.2
10
So what?
0
Before = 6.4
After = 9.2
10
So what?
0
Before = 6.4
After = 9.2
10
So what?
0
Before = 6.4
After = 9.2
10
So what?
0
Before = 6.4
After = 9.2
10
So what?
0
Before = 6.4
After = 9.2
10
So what?
σ
0
Before = 6.4
σ
σ
After = 9.2
10
One Tail / Two Tail

One-Tailed



H0 : m1 >= m2
HA : m1 < m2
Two-Tailed


H0 : m1 = m2
HA : m1 <>m2
STANDARD NORMAL
DISTRIBUTION




Normal Distribution is defined as
N(mean, (Std dev)^2)
Standard Normal Distribution is defined as
N(0, (1)^2)
STANDARD NORMAL
DISTRIBUTION

Using the following formula :

will convert a normal table into a standard
normal table.
Exercise

If the average IQ in a given population
is 100, and the standard deviation is
15, what percentage of the population
has an IQ of 145 or higher ?
Answer

P(X >= 145)
P(Z >= ((145 - 100)/15))
P(Z >= 3)
From tables: 99.87% are less than 3

=> 0.13% of population



Trends in Statistical Tests used
in Research Papers
Historically
Results in:
Accept/Reject
Currently
Results in:
p-Value
Results in:
Approx. Mean
Confidence Intervals

A confidence interval is used to express the
uncertainty in a quantity being estimated.
There is uncertainty because inferences are
based on a random sample of finite size
from a population or process of interest. To
judge the statistical procedure we can ask
what would happen if we were to repeat the
same study, over and over, getting different
data (and thus different confidence
intervals) each time.
Confidence Intervals
Jerzy Neyman





Born April 16, 1894
Died August 5, 1981
Born in Bessarabia,
Imperial Russia
statistician who spent
most of his professional
career at the University
of California, Berkeley.
Developed modern
scientific sampling
(random samples) in
1934, the NeymanPearson lemma in 1933
and the confidence
interval in 1937.
Egon Pearson






Born 11 August 1895
Died 12 June 1980
Born in Hampstead,
London
Son of Karl Pearson
Leading British
statistician
Developed the NeymanPearson lemma in 1933.






Neyman and Pearson's joint work formally started in
the spring of 1927.
From 1928 to 1934, they published several important
papers on the theory of testing statistical
hypotheses.
In developing their theory, Neyman and Pearson
recognized the need to include alternative
hypotheses and they perceived the errors in testing
hypotheses concerning unknown population values
based on sample observations that are subject to
variation.
They called the error of rejecting a true hypothesis
the first kind of error and the error of accepting a
false hypothesis the second kind of error.
They called a hypothesis that completely specifies a
probability distribution a simple hypothesis. A
hypothesis that is not a simple hypothesis is a
composite hypothesis.
Their joint work lead to Neyman developing the idea
of confidence interval estimation, published in 1937.
Confidence Intervals

Neyman, J. (1937)
"Outline of a theory of
statistical estimation
based on the classical
theory of probability"
Philos. Trans. Roy. Soc.
London. Ser. A. , Vol.
236 pp. 333–380.
Confidence Intervals

If we know the true population mean and
sample n individuals, we know that if the
data is normally distributed, Average
mean of these n samples has a 95%
chance of falling into the interval.
Confidence Intervals

where the standard error for a 95% CI
may be calculated as follows;
Example 1
and others
Example 1

Do the Opposition parties have more of the
popular vote than the Government ?

In a random sample of 721 respondents :



382 Opposition Parties
339 Government
Can we conclude that Opposition parties
has more than 50% of the popular vote?
Example 1 - Solution

Sample proportion = p = 382/721 = 0.53
Sample size = n = 721
Standard Error = (SqRt((p(1-p)/n))) = 0.02

95% Confidence Interval






0.53 +/- 1.96 (0.02)
0.53 +/- 0.04
[0.49, 0.57]
Thus, we cannot conclude that the Opposition Parties
has more of the popular vote, since this interval spans
50%. So, we say: "the data are consistent with the
hypothesis that there is no difference"
Example 2
Example 2

Does Obama have more of the popular
vote than Romney ?

In a random sample of 1000 respondents



532 Obama
468 Romney
Can we conclude that Obama has more
than 50% of the popular vote ?
Example 2 – 95% CI

Sample proportion = p = 532/1000 = 0.532
Sample size = n = 1000
Standard Error = (SqRt((p(1-p)/n))) = 0.016

95% Confidence Interval






0.532 +/- 1.96 (0.016)
0.532 +/- 0.03136
[0.5006, 0.56336]
Thus, we can conclude that Obama has more of the
popular vote, since this interval does not span 50%.
So, we say : "the data are consistent with the
hypothesis that there is a difference in a 95% CI"
Example 2 – 99% CI

Sample proportion = p = 532/1000 = 0.532
Sample size = n = 1000
Standard Error = (SqRt((p(1-p)/n))) = 0.016

99% Confidence Interval






0.532 +/- 2.58 (0.016)
0.532 +/- 0.041
[0.491, 0.573]
Thus, we cannot conclude that Obama has more of the
popular vote, since this interval does span 50%. So,
we say : "the data are consistent with the
hypothesis that there is no difference in a 99% CI"
Example 2 – 99.99% CI

Sample proportion = p = 532/1000 = 0.532
Sample size = n = 1000
Standard Error = (SqRt((p(1-p)/n))) = 0.016

99.99% Confidence Interval






0.532 +/- 3.87 (0.016)
0.532 +/- 0.06
[0.472, 0.592]
Thus, we cannot conclude that Obama has more of the
popular vote, since this interval does span 50%. So, we
say : "the data are consistent with the hypothesis
that there is no difference in a 99.99% CI"
T-Tests
William Sealy Gosset





Born June 13, 1876
Died October 16, 1937
Born in Canterbury,
England
On graduating from
Oxford in 1899, he
joined the Dublin
brewery of Arthur
Guinness & Son.
Published significant
paper in 1908
concerning the tdistribution





Gosset acquired his statistical knowledge by
study, and he also spend two terms in 1906–1907
in the biometric laboratory of Karl Pearson.
Gosset applied his knowledge for Guinness both
in the brewery and on the farm - to the selection
of the best yielding varieties of barley, and to
compare the different brewing processes for
changing raw materials into beer.
Gosset and Pearson had a good relationship and
Pearson helped Gosset with the mathematics of
his papers.
Pearson helped with the 1908 paper but he had
little appreciation of their importance.
The papers addressed the brewer's concern with
small samples, while the biometrician typically
had hundreds of observations and saw no
urgency in developing small-sample methods.
T-Tests

Student (1908),
“The Probable
Error of a Mean”
Biometrika, Vol. 6,
No. 1, pp.1-25.
T-Tests

Guinness did not allow its employees to publish results
but the management decided to allow Gossett to publish
it under a pseudonym - Student. Hence we have the
Student's t-test.
T-Tests



powerful parametric
test for calculating the
significance of a small
sample mean
necessary for small
samples because their
distributions are not
normal
one first has to
calculate the "degrees
of freedom"
~ THE GOLDEN RULE ~
Use the t-Test when your
sample size is less than 30
T-Tests







If the underlying population is normal
If the underlying population is not skewed
and reasonable to normal
(n < 15)
If the underlying population is skewed and
there are no major outliers
(n > 15)
If the underlying population is skewed and
some outliers
(n > 24)
T-Tests

Form of Confidence Interval with tValue
Mean +/- tValue * SE
-------------as before
as before
Two Sample T-Test:
Unpaired Sample

Consider a questionnaire on computer use
to final year undergraduates in year 2007
and the same questionnaire give to
undergraduates in 2008. As there is no
direct one-to-one correspondence between
individual students (in fact, there may be
different number of students in different
classes), you have to sum up all the
responses of a given year, obtain an
average from that, down the same for the
following year, and compare averages.
Two Sample T-Test:
Paired Sample

If you are doing a questionnaire that is
testing the BEFORE/AFTER effect of
parameter on the same population,
then we can individually calculate
differences between each sample and
then average the differences. The
paired test is a much strong (more
powerful) statistical test.
Choosing the right test
Choosing a statistical test
http://www.graphpad.com/www/Book/Choose.htm
Choosing a statistical test
http://www.graphpad.com/www/Book/Choose.htm