Download Introduction to Statistics

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

History of statistics wikipedia , lookup

Opinion poll wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Quantitative Data Analysis:
Statistics – Part 1
"... while man is an insoluble
puzzle, in the aggregate he
becomes a mathematical
certainty. You can, for
example, never foretell what
any one man will do, but you
can say with precision what
an average number will be
up to. Individuals vary, but
percentages remain
constant. So says the
statistician."
Overview

Part 1





Picturing the Data
Pitfalls of Surveys
Averages
Variance and Standard Deviation
Part 2




The Normal Distribution
Z-Tests
Confidence Intervals
T-Tests
~ THE GOLDEN RULE ~
Statistics NEVER replace
the judgment of the expert.
Approach to Statistical
Research
1.
2.
3.
4.
5.
6.
Formulate a Hypothesis
State predictions of the hypothesis
Perform experiments or observations
Interpret experiments or observations
Evaluate results with respect to hypothesis
Refine hypothesis and start again
(Basically the same as all other research)
Hypothesis Testing
H0 : Null Hypothesis, status quo
HA : Alternative Hypothesis, research question
So, either :
"The data does not support H0"
or
"We fail to reject H0"
Types of Data

Continuous


Discrete


# of days worked this week, # leaves on a tree
Ordinal


height, age, time
{Good, O.K., Bad}
Nominal

{Yes/No}, {Teacher/Chemist/Haberdasher}
Picturing The Data
Time-Series Plots


Time related Data
e.g. Stock Prices
Pie Charts

Nominal/Ordinal

Only suitable for
data that adds up
to 1

Hard to compare
values in the chart
Bar Charts

Nominal/Ordinal

Easier to compare
values than pie
chart

Suitable for a wider
range of data
Histograms


Continuous Data
Divide Data into
ranges
Dot Plots

Nominal/Ordinal

Represents all the
data

Difficult to read
Scatter Plots

Excellent for
examining
association
between two
variables
Box Plots



Nominal/Ordinal
1IQR, 3IQR - First
interquartile range
(IQR), third
interquartile range
(3QR)
Outliers
John Tukey
•
•
•
•
•
Born June 16, 1915
Died July 26, 2000
Born in New Bedford,
Massachusetts
He introduced the box
plot in his 1977
book,"Exploratory
Data Analysis"
Also the Cooley–
Tukey FFT algorithm
and jackknife
estimation
John Tukey
Paul Niquette
Claude Shannon John von Neumann
•
While working with John von Neumann on early computer
designs, Tukey introduced the word "bit" as a contraction of
"binary digit". The term "bit" was first used in an article by
Claude Shannon in 1948.
•
The term "software", which Paul Niquette claims he coined in
1953, was first used in print by Tukey in a 1958 article in
American Mathematical Monthly, and thus some people
attribute the term to him.
Question 1
In a telephone survey of 68
households, when asked
do they have pets, the
following were the
responses :



16 : No Pets
28 : Dogs
32 : Cats
Draw the appropriate graphic to
illustrate the results !!
Question 1 - Solution
Total number surveyed = 68
Number with no pets = 16
=>Total with pets = (68 - 16) = 52
But total 28 dogs + 32 cats = 60
=> So some people have both cats and dogs
Question 1 - Solution
How many? It must be (60 - 52) = 8 people
 No pets = 16
 Dogs =
20
 Cats =
24
 Both =
8
------------------------Total = 68
Question 1 - Solution
Graphic: Pie Chart or Bar Chart
Question 1 - Solution
Graphic: Pie Chart or Bar Chart
Bar Chart is
easier to read
Pitfalls of Surveys
The Literary Digest Poll

1936 US Presidential Election

Alf Landon (R) vs. Franklin D. Roosevelt (D)
The Literary Digest Poll
The Literary Digest Poll



Literary Digest had been
conducting successful
presidential election polls
since 1916
They had correctly predicted
the outcomes of the 1916,
1920, 1924, 1928, and 1932
elections by conducting polls.
These polls were a lucrative
venture for the magazine:
readers liked them;
newspapers played them up;
and each “ballot” included a
subscription blank.
The Literary Digest Poll

In 1936 they sent out 10 million ballots to
two groups of people:


prospective subscribers, “who were chiefly
upper- and middle-income people”
a list designed to "correct for bias" from the
first list, consisting of names selected from
telephone books and motor vehicle registries
The Literary Digest Poll


Response rate: approximately 25%, or
2,376,523 responses
Result: Landon in a landslide (predicted
57% of the vote, Roosevelt predicted 40%)
The Literary Digest Poll



Response rate: approximately 25%, or
2,376,523 responses
Result: Landon in a landslide (predicted
57% of the vote, Roosevelt predicted 40%)
Election result: Roosevelt received
approximately 60% of the vote
The Literary Digest Poll


POSSIBLE CAUSES OF ERROR
Selection Bias: By taking names and addresses from
telephone directories, survey systematically excluded
poor voters.
 Republicans were markedly overrepresented
 in 1936, Democrats did not have as many
phones, not as likely to drive cars, and did not read
the Literary Digest
 “Sampling Frame” is the actual population of
individuals from which a sample is drawn: Selection
bias results when sampling frame is not
representative of the population of interest
The Literary Digest Poll


POSSIBLE CAUSES OF ERROR
Non-response Bias: Because only 20% of 10
million people returned surveys, nonrespondents may have different preferences from
respondents
 Indeed, respondents favored Landon
 Greater response rates reduce the odds of
biased samples
Definitions and Formula
Terminology




Population: is a set of entities concerning which
statistical inferences are to be drawn.
Sample: a number of independent observations
from the same probability distribution
Parameter: the distribution of a random variable as
belonging to a family of probability distributions,
distinguished from each other by the values of a
finite number of parameters
Bias: a factor that causes a statistical sample of a
population to have some examples of the
population less represented than others.
Outliers (and their treatment)
Outliers (and their treatment)

An "outlier" is an observation that does not
fit the pattern in the rest of the data




Check the data
Check with the measurer
If reason to believe it is NOT real, change it if possible,
otherwise leave it out (but note).
If reason to believe it is real, leave it out and note.
The Mean



The Mean (Arithmetic)
The mean is defined as the sum of all
the elements, divided by the number of
elements.
The statistical mean of a set of
observations is the average of the
measurements in a set of data
The Mode

The mode is defined as the most frequently
element in a set of elements.


For example [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] has
a mode of 6.
Given the list of data [1, 1, 2, 4, 4] the mode
is not unique - the dataset may be said to be
bimodal, while a set with more than two
modes may be described as multimodal.
The Median



The median is defined as the middle
element, or the value separating the higher
half of a sample from the lower half.
If there is an even number of elements, it is
half the sum of the middle two elements.
Given the list of data [1, 1, 2, 4, 4] the
median is 2.
The Variance
But there can be a lot of variance in
individual elements,
e.g. teacher salaries
Average = €22,000
Lowest = € 12,000
Difference = 12,000 - 22,000 = -10,000

The Variance
Significant
variance
The Variance
Little
variance
The Variance
No
variance
The Variance

Sum of (Sample - Average) = 0, thus we
need to define variance.

The variance of a set of data is a
cumulative measure of the squares of the
difference of all the data values from the
mean divided by sample size minus one.
Standard Deviation

The standard deviation of a set of data is
the positive square root of the variance.
-1
-1
Karl Pearson






Born 27 March 1857
Died 27 April 1936
Born in Islington,
London, England
Father of Mathematical
Statistics
protégé of Francis
Galton
Inventor of the P-value,
the Pearson correlation
coefficient, Chi distance,
the Method of moments,
and Principal
Component Analysis




Karl Pearson the term "standard deviation" in
1893, "although the idea was by then nearly a
century old" (Abbott; Stigler, page 328).
The term "standard deviation" was introduced
in a lecture of 31 January 1893, as a
convenient substitute for the cumbersome "root
mean square error" and the older expressions
"error of mean square" and "mean error."
The term was first used in a publication in 1894
by Pearson in "Contributions to the
Mathematical Theory of Evolution,"
(Philosophical Transactions of the Royal
Society A, 185, (1894), 71-110.).
http://jeff560.tripod.com/s.html
Question 2

Find the mean and variance of the
following sample values :
36, 41, 43, 44, 46
Question 2

Mean:

=(36 + 41 + 43 + 44 + 46) / 5

=210 / 5

=42
Question 2
Variance
Difference
Square
36 – 42 = -6
36
41 – 42 = -1
1
43 – 42 = 1
1
44 – 42 = 2
4
46 – 42 = 4
16
-----------------------------------58
Variance
= 58 / (5 -1)
= 58 / 4
= 14.5
Standard Deviation
= SquareRoot(14.5)
= 3.8
http://www.oerrecommender.org/visits/94142
http://mathforum.org/library/drmath/view/65410.html