Download Slide - Courses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Statistics wikipedia, lookup

Transcript
The eternal tension in
statistics...
Between what you really
really want (the population)
but can never get to...
So you have to make do (with
the sample)
you can estimate the population,
make educated guesses,
but bottomline is “you can
never have the population”
An investigator usually wants to
generalize about a class of
individuals/things (the population)
For example:
in forecasting the results of elections,
population = voters
for the Consumer Attitudes Survey:
Population = all potential users of Cell
Phones
• Parameters: Usually there are some
numerical facts about the population
which you want to estimate
• Statistic: You can do that by measuring
the same aspect in the sample
(Descriptive Statistics)
• Depending on the accuracy of
measurement, and representativeness
of your sample, you can make
inferences about the population
(Inferential Statistics)
• One person’s sample is another
person’s population
– IS 271 students are a sample for the
larger student population of UC
Berkeley
– IS271 students could be population
for some other study
Understanding Populations and
Samples with brown M&M’s
Yellow 20%
Brown 30%
Orange 10%
Blue 10%
Red 20%
Green 10%
Original Distribution
The distribution of the population
Sample 1
Sample 2
Sample 3
Population
Sample 1
Sample2
Sample3
Sample3
5 Samples
It is a remarkable fact that many histograms in
real life tend to follow the Normal Curve.
For such histograms, the mean and SD are
good summary statistics.
The average pins down the center, while the SD
gives the spread.
For histogram which do not follow the normal
Curve, the mean and SD are not good
summary statistics.
What when the histogram is not normal ...
Properties of the Normal
Probability Curve
• The graph is symmetric about the mean
(the part to the right is a mirror image of
the part to the left)
• The total area under the curve equals
100%
• Curve is always above horizontal axis
• Appears to stop after a certain point (the
curve gets really low)
1 SD= 68%
2 SD = 95%
3 SD= 99.7%
• The graph is symmetric about the mean =
• The total area under the curve equals
100%
• Mean to 1 SD = +- 68%
• Mean to 2 SD = +- 95%
• Mean to 3 SD = +- 99.7%
• You can disregard rest of curve
Distribution of judges ratings for the
Webby Awards (scale of 1 –10)
500
400
Mean = 6.3
Median = 6.3
300
Std. Dev = 1.98
200
N = 1867.00
100
Skewness = -.43
Kurtosis = -.201
0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Distribution of word count on web pages
500
400
300
Std. Dev = 384.83
Mean = 348.3
200
100
0
+- 3 SD = (384 * 3) = 1152
Mean - 1152 = about 30% sample had negative number of
links
Measures of Normality
• Visual examination
• Skewness: measure of symmetry
Positively Skewed
Negatively Skewed
Symmetric
Kurtosis: Does it cluster in the
middle?
Kurtosis is based on a distributions tail.
Distributions with a large tail: leptokurtic
Distributions with a small tail: platykurtic
Distributions with a normal tail: mesokurtic
Large tail
Small tail
Normal Tail
Positively Skewed and
Leptokurtic: Word Count
1600
1400
1200
1000
800
Mean = 393.2
Median = 223
Std. Dev = 725.24
Skewness = 13.62
Kurtosis = 321.84
600
N = 1903.00
400
200
0
Distribution of word count
(N=1897) top six removed
800
Kurtosis = 16.40
Skewness = 3.49
600
400
Mean = 368.0
Median = 223
Std. Dev = 474.04
N = 1897.00
200
0
The Importance of Good
Sampling Techniques
The 1936 election: the literary
digest poll
• Candidates: Democrat FD Roosevelt
and Republican Alfred Landon
• The Literary Digest: had called the
winner in every election since 1916
• Its prediction: Roosevelt will get 43%
• Sample Size: 2.4 million people!
The election results
Percentage vote for Roosevelt
• The election result
62%
• The Digest prediction
43%
Literary Digest went bankrupt soon after
George Gallup just setting up his organization
Gallup’s prediction of Digest Prediction
44%
(Sample size = 3000)
Gallups’s prediction of election result
(Sample Size = 50,000)
56%
Why the Digest went wrong:
How they picked their sample
• Selection Bias: A systematic tendency on the
part of the sampling procedure to exclude one
kind of person or another from sample
• Sample Size: When a selection procedure is
biased, making the sample larger does not
help: repeats the mistake on a larger level
How they picked their sample
• Non Response Bias: Non respondents differ
from respondents
– they did not respond as compared to respondents
who did!
– Lower income and upper income people tend not to
respond, so middle class over represented.
– Non Response Bias: One can give more weightage to
people who were available but hard to get.
For Example: Predicting Elections
– Non Voters: Gallup uses a few questions to predict if
people will vote at all. Election forecast based only
on those likely to vote.
– Undecided: Asks people who they are leaning
towards as of today.
– Non Response Bias: One can give more weightage to
people who were available but hard to get.
– Ratio Estimation: Look at sample obtained, and
compares it to population. If there are too many
educated people weigh them lesser.
– Interviewer Bias: Build redundancy into
questionnaire to check for consistency. Also
reinterview a small sample to check for consistency.
How much is each sample going to deviate
from the population? (how big is the chance
error for each sample likely to be?)
Computation of Standard Error
SD of sample /  number of samples
List of numbers: 9, 7, 6, 9, 11, 12
Mean = 9, Standard Deviation = 2.2
Standard Error = .93
Standard Error varies in inverse proportion to the
square root of the number of samples. Therefore, as
number of samples grows bigger, standard error
grows smaller.
If there is a lot of spread in the samples, the SD is
big and it will be hard to predict how accurate the
sample will be. So the standard error will be big as
well.
Standard Deviation (SD) and Standard Error (SE):
SD refers to a list of number. How far are most
numbers from the mean?
SE refers to the variability in samples. How variable
is each sample going to be.
Understanding the computation of the standard
error
Standard Error is directly related to how variable
the numbers in the sample are. Therefore it is
directly related to the standard deviation.
But Standard Error is also related to the sample
size. The larger the sample size, the lesser the
chances of chance error. Therefore it is inversely
related to the square root of the sample size.
Why is knowing chance error
important?
• Allows us to estimate the accuracy of our
estimates and if we are justified in using
inferential statistics.
• Allows us to make inferences about the
population, by accounting for chance error
Three components of a measurement
True Value + Systematic Bias + Chance Error
-you want to get at true value
-you want to eliminate systematic bias
-you want to estimate chance error
Estimating Sample
Size:
Should the sample size
for Texas be larger than
that for Rhode Island?
Surprisingly: No
Analogy: If you took a drop of liquid for analysis. If
the liquid is well mixed, then it would not matter if
the liquid was from a small or a large bottle,
whether the sample is 1% or .1% of the population..
The statistical rationale: The accuracy of sampling is
related to the standard deviation of the sample.
Example: Election of 1992, % voters who chose Clinton
46% of voters in New Mexico, SD =.50
37% of voters in Texas, SD =.48
Therefor accuracy of sample in Texas and New Mexico
will be similar
Types of Samples
• The convenient sample: More convenient
elementary units are chosen from a
population.
• The judgement sample: Units are chosen
according to judgement made by someone
who is familiar with the relevant
characteristics of the population.
• The random sample: Units are chosen
randomly with a known probability.
• Quota Sampling: Each interviewer
is assigned a fixed quota of
subjects fitting certain
demographic characteristics.
Within the quota is a judgement
sample.
– Problems: quotas might not be
representative, and judgement
sampling is bad.
Types of Random Sample
• Simple Random Sample: Every unit of the
population has an equal chance of being
chosen.
• A systematic random sample: One unit is
chosen on a random basis, additional
elementary units are taken from evenly
spaced intervals until the desired number of
units is obtained.
• The stratified random sample:
Obtained by independently selecting a
separate simple random sample from each
population stratum. A population can be
divided into different groups:based on
some characteristic or variable like income
of education.
• The cluster sample: Obtained by
selecting clusters from the population on
the basis of simple random sampling. The
sample comprises a census of each random
cluster selected. For example, a cluster may
be some thing like a village or a school, a
state.