Download primer - University of St. Thomas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

World Values Survey wikipedia , lookup

Foundations of statistics wikipedia , lookup

Time series wikipedia , lookup

History of statistics wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Statistics
Primer
Thomas P. Sturm, Ph.D.
University of St. Thomas
St. Paul, Minnesota
The Nature of Statistics
Presentation
Descriptive Statistics
Data Collection
Contents:
Part I - The Nature of Statistics
Part II - Presentation
Part III - Descriptive Statistics
Part IV - Data Collection
3
11
27
37
Copyright © 1971-1997 Thomas P. Sturm
All rights reserved
MINITAB and Minitab for Windows are registered trademarks of
Minitab, Inc.
Statistics Primer
2
Table of Contents
The Nature of Statistics
What is statistics
Types of data
Scales of measurement
Copyright  1993-97 Thomas P. Sturm
What is Statistics
Statistics is:
The SCIENCE of
a. COLLECTING,
b. Classifying / Presenting / Tabulating / Describing, and
c. INTERPRETING,
NUMERICAL Data
All three areas will be covered:
Collecting - Chapter 3
Describing - Chapters 1 and 2
Interpreting - bulk of the course
Course Goal:
To produce good "statistical consumers"
Statistics Primer
4
The Nature of Statistics
Collecting Data
Data must be collected with a purpose - to find information about a
designated group of people/places/things/events
POPULATION - the collection of ALL objects that are of interest
- must be carefully defined
- must be able to determine under all circumstances whether
something is in the population or not
e.g. employees - current? fired? retired? part-time?
Problem: It's usually just too expensive (or impossible) to get the
information for all objects in a population (a CENSUS)
SAMPLE - a subset of the population used to find information
about the entire population
- more economical
- with care, can obtain an accurate picture of the population
So, to get information about the population, we take a sample and
find information about the things in the sample
Statistics Primer
5
The Nature of Statistics
Variables in Statistics
PROPERTY
An attribute that is relevant for all things in the population
(and therefore the sample)
e.g. height, weight, color, result of casting a die, beauty
VARIABLE
Any characteristic than can be measured for all things in the
population
e.g. height (in inches), weight (in pounds), color (a word),
# of spots on a die
OBSERVATION
A VALUE for a variable is assigned through a process of
MEASUREMENT
e.g. use a ruler to MEASURE a VALUE of 6'4" as the
OBSERVED height of a basketball player
POSSIBLE VALUES
values that COULD be obtained
e.g. 0 to 100% on an exam
OBSERVED VALUES
values that are actually obtained in the current instance
e.g. 97%, 92%, 84%, 63% in a class of 4 students
Statistics Primer
6
The Nature of Statistics
Types of Data
QUALITATIVE
ATTRIBUTE or CATEGORICAL data
useful only to place individuals into categories
(e.g. Earthlings, Martians)
QUANTITATIVE
DISCRETE
a finite set of values
e.g. number of students
CONTINUOUS
an infinite set of values in a bounded range
e.g. height of students
But statistics only deals with NUMERICAL data, (and
MEASUREMENT assigns a numerical value to a
VARIABLE,) so, for QUALITATIVE data, part of the
measurement process is to assign a number to each
attribute value
e.g. SEX - 1=male, 2=female, etc.
Thus, as part of the measurement process, everything gets a
number. But what can you DO with those numbers ???
Statistics Primer
7
The Nature of Statistics
Scales of Measurement
Nominal Scale (Qualitative data)
e.g. 1=male, 2=female
come from qualitative (attribute) data
can only count how many of each value you have to obtain
FREQUENCY data
cannot sort, add, subtract, multiply, or divide the numbers
Ordinal Scale (Ordinal data)
e.g. 1=never, 2=occasionally, 3=frequently, 4=always
come from a condensation of quantitative data where asking
for specific numbers would not be accurate
can sort in addition to count, 1 < 2
cannot add, subtract, multiply, or divide the numbers
Interval Scale (Metric data)
e.g. temperature in Fahrenheit
come from quantitative values that are measured against
arbitrary starting points
can subtract in addition to sorting and counting, 24 outside,
72 inside, 48 degrees warmer inside
cannot add, multiply, or divide the numbers
Ratio Scale (Metric data)
e.g. number of courses taken, any FREQUENCY data, rates
come from quantitative values that have "natural" zeroes
0 is meaningful, Pat took 6 courses, Chris took 2 courses, Pat
took 3 times as many courses as Chris
can perform all operations
Statistics Primer
8
The Nature of Statistics
Distribution/Variation
In general, not all of the measurements yield the same value. This
could be because of different measurements of the same
thing or measurement of different members of a sample.
This is called VARIATION.
The values of the data have some sort of a DISTRIBUTION which
characterizes where in the range of POSSIBLE values the
OBSERVED values most frequently fall.
Much of descriptive statistics deals with finding simple ways
(perhaps as simple as a single number) of describing the
distribution.
Nominal and ordinal data allow the least amount of mathematical
manipulation, so the description of nominal and ordinal
data is limited to counting the frequencies of the
observations (and sorting the observations if on an ordinal
scale) and then presenting the counts.
Statistics Primer
9
The Nature of Statistics
Statistics Primer
10
The Nature of Statistics
Presentation
Properties of a Table
Bar charts
Pie charts
Graphs
Statistical lying
Tally sheets
Frequency histograms and polygons
Stem and Leaf diagrams
Copyright  1993-97 Thomas P. Sturm
Properties of a Table
Every statistical table worthy of our consumption should have, at a
minimum, the following:
TITLE - states exactly and as succinctly as possible what the
entries are
CAPTION - distinguishes one body of information from another
STUB - distinguishes each member of a series from another
BODY - contains the actual figures
UNITS of MEASUREMENT - somewhere in the table (caption,
stub, or title) the units must be given (number, dollars,
millions, etc.)
SOURCE - describes where the information was obtained
Without ALL of these, a table loses its credibility - because it is the
kind statistical liars use
Statistics Primer
12
Presentation
Example Statistical Table
Governmental Per Capita Tax Revenue
(Dollars)
Year
1960
1962
1964
1965
State and Local
200.66
223.62
249.75
266.11
Federal
427.81
442.69
473.03
483.49
Total
628.47
666.32
722.78
749.61
Source: Statistical Abstract of the U.S., 1966, 87th ed.
(Washington, D.C., 1966), p. 417.
Statistics Primer
13
Presentation
Presenting Categorical Data
Categorical (nominal) data can only be counted. You cannot
"average" it, subtract it, or divide it. However, you can
present it in a wide variety of ways.
A survey of 120 University of St. Thomas students on the question:
"Given the choice between the two, do you prefer to eat at
Scooters on in the Grill?"
Response
Grill
Scooters
No opinion
Total
Frequency
74
37
9
120
Source: Postoffice box mail survey taken by QMCS 220 students
during spring semester, 1993.
The stub/caption/body of the table could have been presented in a
variety of other ways, e.g.:
- It could have included relative frequencies as well
Response
Yes
No
No opinion
Total
Statistics Primer
Frequency
74
37
9
120
14
Relative Frequency
.62
.31
.07
1.00
Presentation
Histograms and Bar Charts
- It could have been done as a (vertical) histogram
80
74
70
60
50
37
40
30
20
9
10
0
Grill
Scooters
No Opinion
- It could have been done with a horizontal bar chart
No Opinion
9
Scooters
37
Grill
74
0
Statistics Primer
10
20
30
40
50
15
60
70
80
Presentation
Pie Charts
- It could have been done with a pie chart
Scooters
31%
No Opinion
8%
Grill
61%
- It could have been done with a picture graph
Grill
Scooters
NoOpinion
Statistics Primer
16
Presentation
Graphing a Statistical Lie
- The following is NOT a valid way of presenting the same
information:
Grill
Scooters
No Opinion
Statistics Primer
17
Presentation
Becoming a Better Statistical Consumer
1. Consider the source
How many details of the study are given?
- study with full details included
- study with details available
- informed opinion
- opinion poll
Who did the study?
- name identification, independent organization
- audited and published by respected publisher
- special interest organizations
- self-interest groups
How stable is the data?
historical
current
forecast
2. Does it make sense?
Is it an "offhand" percentage?
Is it a "less than" comparison? Less than what?
Does it lack internal consistency
e.g. percents of 20 items should be multiples of 5
Does it have too much precision, regularity, or even #s?
Is it plausible?
Is the arithmetic correct?
3. Are the conclusions correct?
Are the survey results consistent with the conclusions?
Have the definitions remained consistent over time?
Are there wrong interpretations of right results?
Are there confusing counts and percentages?
Statistics Primer
18
Presentation
Presenting Ratio Data
Consider the following data:
Survey of number of hours taken to learn the skill of sending
and receiving international Morse code at a rate of 13
words per minute:
80
90
88
73
98
89
52
69
63
97
78
88
92
83
78
109
98
64
75
94
83
100
76
81
82
67
85
85
100
70
96
61
75
95
58
105
70
96
81
88
108
64
Generally, we want to see how the OBSERVATIONS are spread
out across the range of POSSIBLE VALUES
The data above in random order is of little use
Statistics Primer
19
Presentation
Tally Sheet
Put the data into groups of adjacent numbers, describing the
situation by enumerating the number of occurrences in
each range of numbers or CLASS
The number of occurrences is called the FREQUENCY or class
frequency
We would like to use "convenient" numbers to divide the data into
groups, and we would like to end up with 6 to 14 intervals
for under 100 observations, 7 to 15 intervals for over 100
observations.
Lower
class
limit
50
60
70
80
90
100
Upper
class
limit
59
69
79
89
99
109
Class
(Class
Interval)
50-59
60-69
70-79
80-89
90-99
100-109
Class
mark
54.5
64.5
74.5
84.5
94.5
104.5
Tally
//
///// /
///// ///
///// ///// //
///// ////
/////
Frequency
2
6
8
12
9
5
Note: No overlap between the groups, groups of equal size,
difference between upper class limit of one group and
lower class limit of the next is the "smallest unit" of
measure.
Statistics Primer
20
Presentation
Frequency Histogram
12
10
8
6
4
2
0
4.5 14.5 24.5 34.5 44.5 54.5 64.5 74.5 84.5 94.5 105 115
Note: Adjacent vertical bars are connected (compare with
disconnected bars for categorical data). Values on
horizontal axis refer to midpoint of vertical bar.
Statistics Primer
21
Presentation
Frequency Polygon
12
10
8
6
4
2
0
4.5
14.5
24.5
34.5
44.5
54.5
64.5
74.5
84.5
94.5
104.5
114.5
Note: Must plot data for two classes beyond the range of the data
(in this case 40-49 and 110-119) that have a frequency of
0. Must plot the frequency at the class mark.
Statistics Primer
22
Presentation
Stem and Leaf Diagram
5
6
7
8
9
10
28
134479
00355688
011233558889
024566788
00589
or
5*
56*
67*
78*
89*
910*
10-
Statistics Primer
2
8
1344
79
003
55688
011233
558889
024
566788
00
589
23
Presentation
Another Stem and Leaf Diagram
Consider the following data:
10
15
12
11
10
15
19
14
13
13
13
25
22
17
18
14
20
24
20
22
16
18
19
21
23
18
17
19
19
15
20
11
16
14
13
22
10
18
16
12
Stem and Leaf Diagram to obtain 6 to 14 groups:
1*
1T
1F
1S
12*
2T
2F
Statistics Primer
00011
223333
444555
66677
88889999
0001
2223
45
24
Presentation
Minitab Commands to Learn
(Command Driven Version)
HELP
EXIT
PAPER
NOPAPER
READ
PRINT
NAME
SET
LET
HISTOGRAM
DOTPLOT
STEM-AND-LEAF
DESCRIBE
MEAN
...
Statistics Primer
25
Presentation
Minitab Commands to Learn
(Menu Driven Version)
File
Open
Save
Save as
Worksheet description
Print window
Manipulate
Delete rows
Erase variables
Calculate
Column statistics
Statistics
Basic statistics
Descriptive statistics
Graphs
Correlation
Covariance
Normality Text
EDA (Exploratory Data Analysis)
Stem-and-leaf
Boxplot
Graph
Plot (1 variable vs. another)
Chart (sum of 1 variable vs. another)
Histogram (1 variable with values)
Boxplot (category vs. values)
Pie Chart (2 varieties)
Help
Statistics Primer
26
Presentation
Descriptive
Statistics
Measures of Central Tendency
Measures of Dispersion
Box Plots
Copyright © 1995 - 1997 Thomas P. Sturm
Home Run Data
Babe Ruth (sorted)
22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60
Roger Maris (sorted)
8, 13, 14, 16, 23, 26, 28, 33, 39, 61
Histograms:
4
Frequency
3
2
1
0
20
25
30
35
40
45
50
55
60
Ruth
Frequency
3
2
1
0
10
20
30
40
50
60
Maris
Statistics Primer
28
Descriptive Statistics
Stem-and-Leaf Plots
Stem-and-leaf of Ruth
Leaf Unit = 1.0
1
2
3
4
6
(5)
4
2
1
2
2
3
3
4
4
5
5
6
0
1
2
3
4
5
6
= 15
N
= 10
2
5
4
5
11
66679
44
9
0
Stem-and-leaf of Maris
Leaf Unit = 1.0
1
4
(3)
3
1
1
1
N
8
346
368
39
1
Side-by-side Stem and leaf plots:
Ruth
2
3
4
5
6
25
45
1166679
449
0
Statistics Primer
Maris
8
346
368
39
0
1
2
3
4
5
6 1
29
Descriptive Statistics
Measures of Central Tendency
Mode - most frequently occurring observation from a set of
grouped data.
Ruth: about 45; Maris: about 20
Mean - (arithmetic mean) - the sum of the observations divided by
the number of observations.
Ruth hit 659 home runs in 15 years or 659/15 = 43.93
Maris hit 261 home runs in 10 years or 261/10 = 26.1
Median - the number in the center from a set of sorted data
Ruth:
22, 25, 34, 35, 41, 41, 46, 46, 46, 47, 49, 54, 54, 59, 60
Maris:
8, 13, 14, 16, 23, 26, 28, 33, 39, 61
Thus for Ruth the median is 46.00; for Maris the median is
(23 + 26)/2 = 24.5
Midrange - the average of the lowest plus the highest observation
Ruth: (22 + 60)/2 = 41; Maris: (8 + 61)/2 = 34.5
5% Trimmed mean - the mean after trimming off the highest 5%
and lowest 5% of the values
Ruth: 577/13 = 44.38; Maris: 192/8 = 24
Resistance to outliers: The midrange is the least resistant. The
mean is also not resistant. The trimmed mean is resistant
to up to 5% outliers at each end. The mode is generally
resistant unless there are a cluster of them. The median is
totally resistant.
Statistics Primer
30
Descriptive Statistics
Measures of Dispersion
Range - the difference between the largest and smallest observation
Ruth: 60 - 22 = 38; Maris: 61 - 8 = 53
No resistance to outliers
Interquartile range (IQR) - the difference between the third quartile
and the first quartile
Ruth: Q1 = 35, Q3 = 54, Q3 - Q1 = 19
Maris: Q1 = 14, Q3 = 33, Q3 - Q1 = 19
Very high resistance to outliers
Five-Number summary
Minimum, Q1, Median, Q3, Maximum
Ruth: 22, 35, 46, 54, 60; Maris: 8, 14, 24.5, 33, 61
Outliers - Observations more than 1.5 x IQR below Q1 or more
than 1.5 x IQR above Q3
Boxplot
Ends of box are at the quartiles, line within the box marks the
median, two lines (whiskers) extend to smallest and largest
observations.
Modified Boxplot
Ends of box are at the quartiles, line within the box marks the
median, two lines (whiskers) extend to smallest and largest
observations that are not outliers. Outliers are plotted
individually and separately.
Statistics Primer
31
Descriptive Statistics
Boxplots
Boxplot of Ruth
20
30
40
50
60
Ruth
Boxplot of Maris
10
20
30
40
50
60
Maris
Statistics Primer
32
Descriptive Statistics
Variance and Standard Deviation
These measures of dispersion are only valid when the mean is used
as the measure of central tendency.
They are not as resistant to outliers as the median, but have the best
theoretical properties when the distribution is “well
behaved”
Variance: Intuitively, the adjusted average square of the
differences from the mean. This is a theoretical measure.
Ruth: s² = (1/14)[(22-43.93)² + (25-43.93)² + (34-43.93)² +
(35-43.93)² + (41-43.93)² + (41-43.93)² + (46-43.93)² +
(46-43.93)² + (46-43.93)² + (47-43.93)² + (49-43.93)² +
(54-43.93)² + (54-43.93)² + (59-43.93)² + (60-43.93)²] =
(480.9 + 358.3 + 98.60 + 79.74 + 8.585 + 8.585 + 4.285 +
4.285 + 4.285 + 9.425 + 25.70 + 101.4 + 101.4 + 227.1 +
258.2)/14 = 1771/14 = 126.5
Maris: s² = (1/9)[(8-26.1)² + (13-26.1)² + (14-26.1)² +
(16-26.1)² + (23-26.1)² + (26-26.1)² + (28-26.1)² +
(33-26.1)² + (39-26.1)² + (61-26.1)²] = (327.6 + 171.6 +
146.4 + 102.0 + 9.610 + .01000 + 3.610 + 47.61 + 166.4 +
1218)/9 = 2193/9 = 243.7
Standard deviation: The square root of the variance. This is an
interpretive measure.
Ruth: 126.5 = 11.25; Maris: 243.7 = 15.61
Statistics Primer
33
Descriptive Statistics
Computation of the Variance as a Spreadsheet
Consider the data for Maris. This can be placed in tabular form
for ease of computation (by hand, by spreadsheet, or by
Minitab).
The work is done a column at a time, from left to right. The
results from each column, in general, are used in the
computations in the next column.

x
Statistics Primer
x
xx
8
13
14
16
23
26
28
33
39
61
261
-18.1
-13.1
-12.1
-10.1
-3.1
-.1
1.9
6.9
12.9
34.9
0.0
 x  261  261.
n
 x  x 2
327 .6
171 .6
146 .4
102 .0
9 .610
.01000
3 .610
47 .61
166 .4
1218 .
2193 .
2
  x  x  2193
s2 
10
34
n 1

9
 243.7
Descriptive Statistics
Shortcut formula for variance
Use the shortcut formula for variance if you are going to program
the calculation or do it by hand
1. Calculate sum, square the sum.
2. Calculate square of each observation, sum the squares.
3. Divide the result in part 1 by the number of observations.
4. Subtract the result in part 3 from the result in part 2
5. Divide the result in part 4 by one less than the number of
observations.
Ruth: 1. Sum is 659. Square of sum is 434281.
2. 22² + 25² + 34² + 35² + 41² + 41² + 46² + 46² +
46² + 47² + 49² + 54² + 54² + 59² + 60² = 484 + 625 +
1156 + 1225 + 1681 + 1681 + 2116 + 2116 + 2116 + 2209
+ 2401 + 2916 + 2916 + 3481 + 3600 = 30723
3. 434281/15 = 28952
4. 30723 - 28952 = 1771
5. 1771/14 = 126.5
Note: Extra accuracy is needed in steps 1 to 3 because it is
anticipated that the two numbers subtracted in part 4 will
be “nearly” equal.
Statistics Primer
35
Descriptive Statistics
Output from the Minitab DESCRIBE Command
Variable
Mean
Ruth
2.90
Maris
4.94
Variable
Ruth
Maris
Variable
N
Mean
Median
Tr Mean
StDev
SE Mean
Min
Max
Q1
Q3
Statistics Primer
N
Mean
Median
Tr Mean
StDev
15
43.93
46.00
44.38
11.25
10
26.10
24.50
24.00
15.61
Min
22.00
8.00
Max
60.00
61.00
Q1
35.00
13.75
Q3
54.00
34.50
SE
Name of the variable (or C# if unnamed)
Number of observations
Arithmetic mean
Median
5% Trimmed mean
Standard deviation
Standard error of the mean (not covered until Chapter 4)
Minimum value
Maximum value
First quartile
Third quartile
36
Descriptive Statistics
Data Collection
Measurement
Sampling Methods
Survey Design
Designing Experiments
Copyright  1993-97 Thomas P. Sturm
Experimentation / Measurement
- A method of determining a specific value for a variable
- To have VALIDITY, must insure that the variable used to
represent the property is relevant or appropriate.
e.g. asking for height in gallons is not appropriate
- To accurately portray the characteristics of the population, use an
INSTRUMENT that possesses the following
characteristics:
UNBIASED
Bias is a systematic tendency to misrepresent (overstate or
understate the true value) the data in some way
e.g. AGE on the survey - biased 1/2 year low
PRECISE
Lack of precision causes observed values obtained through
the measurement process to be somewhat distant or scatter
from their "true" value
e.g. EARNINGS LAST SUMMER - how many
responses would stand up to an IRS audit for accuracy
RELIABLE
Unreliable results are those which would be quite different
if the experiment/observation were made again under
"identical" circumstances
e.g. PICK A NUMBER FROM 1 to 10 - how many
would pick the same number again and again
Statistics Primer
38
Data Collection
Sampling
Sampling is the process by which we select the sample that we are
going to measure.
The sampling itself must be done to provide an unbiased, reliable,
and precise estimate of the values of the population it is
intended to represent.
A CENSUS is actually an attempt to "sample" the entire
population, and is generally expensive
- could be impossible (flash bulb testing??)
- could be inaccessible (homeless??)
IF you expend enough effort to get everything, is the most
accurate, reliable, and unbiased
A CONVENIENCE SAMPLE is a sample of whatever is the
easiest to measure
- students in a class
- what you happen to have on hand
generally the most prone to inaccuracy and unreliability,
and very likely to be biased
A SELF-SELECTED SAMPLE is a sample of people who "choose
themselves" to be in the survey
- phone-ins to 900 numbers
- mail-back surveys without follow-up
generally the most prone to bias, and very likely to be
inaccurate and unreliable
Statistics Primer
39
Data Collection
Sampling Methods
The following sampling methods, when used with care within their
limits of applicability, can produce unbiased, reliable, and
precise results
SIMPLE RANDOM SAMPLE (SRS) - every member of the
population has exactly the same probability of being
selected
- can be hard to make the probabilities exactly equal
- can miss an accurate description of "rare" subsets
- could still be expensive
STRATIFIED SAMPLE - divide the population into "strata" and
then perform an SRS on each strata
- e.g. healthy adults vs. those with a rare disease
- need to know relative sizes of the strata
SYSTEMATIC SAMPLE - start at a random point, and then select
every kth item
e.g. for a sample of 1/10th of the population at an event
that issued numbered tickets, pick at random a digit from 0
to 9, and then include everyone whose ticket number
ended with that digit
- could be just a expensive as SRS
CLUSTER SAMPLE - pick, at random, areas or regions or groups
of the population, then perform a census within each group
- least expensive of the above methods
- must have enough areas to avoid unreliability
- must carefully check results between groups for bias
Statistics Primer
40
Data Collection
Measurement Errors
- Reporting errors (in 1950 survey, average age of women over 40
was under 40 years old)
- Recording errors (random transcription errors)
- Unit of measurement errors (some in dollars, some in cents; some
per unit, some per six-pack, some per case of 24)
Suggestion: pick a convenient unit of measurement, perhaps
through the use of a consistently applied coding technique
- Processing errors (performing mathematical operations
inappropriate for the scale of measurement of the data)
- Non-response errors (no response from selected groups)
- Errors in doing the sampling
- Errors in adjusting data from stratified samples
- Must accurately classify each response into appropriate
strata
- Must know the proportion of people actually in each strata
- Must properly "scale back" the responses from the
"overrepresented" strata to derive population statistics
Statistics Primer
41
Data Collection
Survey Design
When designing a survey, you must look ahead to the
administration of the survey, the collection of results, the
tabulation of results, the analysis of results, and the
interpretation of results.
To do this successfully, you must ask some basic questions:
- Why am I doing the survey? What specific facts do I hope to
learn more about? What variables might be use to
measure those facts and what variables might influence
those facts?
- Who am I going to survey? What is my population? Am I
surveying people, or doing experiments with physical
objects? Can I realistically obtain the kind of sample I
want from that population at reasonable cost?
- What questions will I ask? Some questions need to address the
specific facts I hope to learn more about, while other
questions need to "consider the source." These latter
questions are called "demographic" questions. For
example, if I want to learn more about pop consumption
on campus, in addition to asking questions about how
much pop is consumed, when it is consumed, where it is
purchased, where it is consumed, diet or regular, etc., I
might also want to ask demographic questions such as age,
class year, sex, weight, day student or boarder, etc.
- The remaining questions are form of the survey, how many
surveys to administer, and how to ask the questions.
Statistics Primer
42
Data Collection
Form of the Survey
Direct measurement in the laboratory:
+ Most accurate and reliable
+ Least subject to unknown influences
- Can be expensive, many times impractical
Direct personal interviews:
+ Consistent measurement if interviewers are well trained
- Time consuming
- Hard to get a random sample
Telephone interviews:
+ Somewhat consistent measurement with skilled
interviewers
+ Less expensive than lab setting or face-to-face interviews
- Lower response rate due to hang-ups and no answers even
after many callbacks
- Incomplete surveys due to hang-ups
Mail surveys:
+ Quick
+ Least expensive
- Lowest response rate (10% to 20%)
- Great deal of opportunity to misinterpret questions
- No idea if person is informed about the subject you are
studying
Statistics Primer
43
Data Collection
How Many Surveys to Administer
Determining the number of surveys to administer depends upon the
following factors:

What form of survey is being done?
- you need to send out about 10 times as many mail surveys
as you would need lab participants

For nominal demographic data, how many categories does
the data divide into (maximum over all questions)?
- you need 5 times as many completed surveys if you have a
demographic question that has 10 possible nominal
responses than if you have a survey all of whose
demographic questions on a nominal scale are logically
divided into two categories (e.g. male/female, boarder/day
student, graduate/undergraduate, yes/no, etc.)

How accurate do the results need to be?
- the more data, the more accurately the measurement can be
done

How much statistics do you know and how much
professional statistical assistance can you afford?
- the SMALLER the sample, the MORE work it is to
accurately draw conclusions from the data
- ideally, you want 30 completed surveys PER
DEMOGRAPHIC GROUP for the demographic question
measured on the nominal scale with the most possible
response values.
Statistics Primer
44
Data Collection
Example Calculation of Number of Surveys to
Administer
Example 1:
We are doing a laboratory experiment in which all of the
demographic questions on a nominal scale are yes/no.
We will need to insure 30 yes responses and 30 no
responses to each demographic question. However, since
we are controlling the demographics, we could get by with
as few are 60 well-crafted experiments.
Example 2:
We are doing a long mail survey in which one of the
demographic questions is on a nominal scale and has 10
possible responses. We need 300 completed surveys at a
minimum to get 30 for each value on the nominal scale.
However, we have no control over the response, so we
could need up to twice that many (600) completed surveys.
Mail survey response is in the 10% to 20% range, but our
survey is long, so we should expect on the low end of that
range. Since we could also use the additional responses if
they come in, we assume a 10% response rate. This
means we should send out about 6000 surveys.
In both of these examples, our ultimate results are likely to have
comparable (low) accuracy, but the statistical analysis
required (because of 30 in each group) will be manageable
without advanced statistical methods.
Statistics Primer
45
Data Collection
How to Ask (Phrase) Survey Questions
Many times the measurement process does more to determine the
scale of measurement that the nature of the property being
measured.
Always want to strive for ratio data or as close to it as possible.
The method of asking a question, establishing a "ruler," or
selecting units of measurement can have a dramatic effect
on the scale of measurement (and the measurement errors)
of the resulting values.
Examples:
Temperature:
Fahrenheit - interval
Kelvin - ratio
Age:
Young/Old check boxes - nominal (unreliable??)
Traditional College Age/Older check boxes - ordinal
Birthdate - can be converted to ratio without bias
In general, try to control the responses to things that can easily be
converted to numerical values on as high a scale of
measurement as possible, and try to provide as much
"calibration" as possible so that the values are being
measured as consistently as possible between subjects.
Statistics Primer
46
Data Collection
Miscellaneous Survey Mechanics
Different respondents may interpret the question differently. This
variation needs to be eliminated with totally unambiguous
language. This is best tested by piloting a survey and
then asking respondents how they interpreted the question.
All respondents must understand the question. The variation in
reading ability needs to be factored out by targeting the
reading level to below grade 8 level difficulty. This can
be measured by computer software.
Respondents may “give up” in the middle of a survey. Make sure
the important questions are asked first, and the less
important questions (less important demographics, for
example) are at the end.
Statistics Primer
47
Data Collection
Designing Experiments
We may be able to measure a phenomenon by subjecting different
experimental units (usually called subjects in this context,
especially when they are human) to different stimuli
(usually called treatments in this context).
This frequently takes the form similar to finding relationships in
categorical data, namely, we attempt to explain the value
of a response variable by noting differences in an
explanatory variable.
The difference in designing experiments is that as designers we
have control over the values of the explanatory variables,
which are called factors in this context.
The experiment is performed by combining specific values (usually
called levels in this context) for each of the explanatory
variables, and measuring the resulting response.
We need for each factor a “control group” that measures the
“normal” outcome when “no treatment” is given.
We need to eliminate confounding variables due to chance. This
is generally done by placing subjects into experimental
groups at random.
Where humans are involved (in any way) we need to eliminate
subjective bias by not allowing subject or researcher to
know what treatment is being received. This is known as
a double-blind experiment.
Statistics Primer
48
Data Collection
Calculating the Number of Participants
The crucial factor in experimental design is to avoid combinatorial
explosion. The most common method of designing an
experiment is called block design. In a block design, we
place a fixed number of people in every category
combination. Ideally, we would like about 30 people in
each category combination.
- If we have 1 factor with 2 levels, we need 2 x 30 = 60
participants.
- If we have 2 factors with 2 levels, we need 2 x 2 x 30 = 120
participants.
- If we have 2 factors, the first with 2 levels and the second with 5
levels, we need 2 x 5 x 30 = 300 participants.
- If we have 3 factors, each with 4 levels, we need 4 x 4 x 4 x 30 =
1920 participants.
- If we have 3 factors, the first with 5 levels, the second with 7
levels, and the third with 11 levels, we need 5 x 7 x 11 x
30 = 11550 participants.
- If we have 4 factors, each with 3 levels, we would need 3 x 3 x 3
x 3 x 30 = 2430 participants.
- If we have 7 factors, the first of which has 3 levels, and each of
the other 6 factors has 20 levels, we would need 3 x 20 x
20 x 20 x 20 x 20 x 20 x 30 = 5,760,000,000 participants,
or more people than there are on earth!
It is therefore not difficult to understand why most experimental
designs involve only 1, 2, or at most 3 factors.
Statistics Primer
49
Data Collection
Statistics Primer
50
Data Collection
Statistics Primer
51
Data Collection
Statistics Primer
52
Data Collection