Download Week2_2015471KB Jan 19 2015 01:10:45 PM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
STAT 421
Week 2
1
Summary of concepts

We have focused only on population
units




Target population
Sampling frame
Selecting a probability sample
We have not discussed any
characteristics of a population element
2
Characteristics of the target
population

We are really interested in making
statements that summarize
characteristics of the target population,
i.e. population parameters



The average school loan debt owed by
currently enrolled ISU students
The total surface area of county parks in
the US
The fraction of Des Moines households that
fall below the poverty line
3
Data value for a target
population element

We calculate these population
summaries from all of the data values
associated with each population
element, i.e., from
yi = the value of characteristic for element i
for i = 1, 2, …, N
4
Data value for a target
population element


A data value is
yi = the value of characteristic for element i
Examples



Element = county in US
yi =
Element = student enrolled at ISU
yi =
Element = Des Moines household
yi =
5
Population distribution of y

y is often referred to as the …
Variable of interest
There are N data values for y
 There is one value of y for each of the N
elements in the target population
There may be fewer than N unique values
 So y has a DISCRETE distribution
How do we summarize discrete distributions?
 Graphical Summary (Histogram)
 Numerical Summary (Population Parameters)



6
Population distribution of y

Histogram



Horizontal axis = all UNIQUE values of y
Vertical axis = frequency of elements with
a specific value of y
To make a histogram, we start with a
table with unique values of y and the
frequency with which they occur
7
Population distribution of y

Histogram for yi = number of courses a
2003 Stat 421 student i was registered for
N = 24
Number of Number of
courses students
y
c(y)
1
0
2
2
3
7
4
6
5
3
6
6
Histogram
8
7
Number of students c(y)

6
5
4
3
2
1
0
1
2
3
4
Number of courses y
5
6
8
Population distribution of y

Discrete probability distribution of y




Horizontal axis = all UNIQUE values of y
Vertical axis = RELATIVE frequency of elements
with a specific value of y , called P(y)
Like a histogram with the frequencies
(vertical axis) divided by N
Usually start by making a table of


Unique values of the variable, y
Relative frequency of the unique values of y, P(y)
9
Population distribution of y

Number of courses example
Probability Distribution
Relative frequency
8
P(y)
0.08
0.29
0.25
0.13
0.25
7
Number of students
Number of
courses
c(y) / N
y
1
0/24
2
2/24
3
7/24
4
6/24
5
3/24
6
6/24
6
5
4
3
2
1
0
1
2
3
4
5
6
Number of courses y
10
Population distribution of y

Numerical Summary: Population
Parameters
Can also describe distribution of y through summary
descriptions called parameters (more later)


Total for y = (100 courses for all students)
Mean of y = (100/24=4.2 courses per
student)
11
Population distribution of y


The population distribution of y is what we
are trying to describe when we draw a
sample, collect data, and calculate an
estimate for a summary parameter
The population distribution of y is FIXED



No matter what sample design we choose
Regardless of the sample we draw from a given
design
We do not assume a parametric distribution

Forget normal distributions assumed in other
classes (for now)
12
Survey design

Survey design involves selecting
methods for all phases of the survey
process




Objectives
Sample design
Data collection approach
Analysis approach
13
Response process
The process of collecting data
from sampled units, e.g., via a
questionnaire or observation form
Response process


Assume we have selected a probability
sample from a frame
The next step is to collect data from each
sampled element


This will lead to values for yi for each element in
the sample
We will act like we only collect one
characteristic from each sampled element,
but usually, we are collecting dozens or even
100s of different characteristics from each
element
15
Problems in response process

We rarely get

complete data from all sampled units


We may fail to obtain data through some part
of the response or data collection process (e.g.,
nonresponse)
or data that are free of error

Data obtained may not be accurate (e.g., recall
bias)
16
Problems in response process

Outcomes for a sample selected from a frame


Can not locate/contact a sampled unit (e.g.,
household) – unreachable
Locate/contact a unit, but can’t get any data



Sampled person refuses to participate - nonresponse
Sampled person is unable to participate (illness)incapable
Collect data on some, but not all characteristics


Respondent doesn’t answer all questions
Data collector forgets to record a variable
17
Response process and
eligibility


Recall that the frame (and thus sample) may
contain units that do not belong in the target
population
In this case, we need to “screen” the unit to
determine if the unit is eligible to be included
in the survey

Eligible means that the unit belongs to the target
population
18
Summary of sampling and
response process problems

Sampling frames (and hence, the
selected sample) may



Include ineligible units
Exclude eligible units
The response process may

Get incomplete data due to



Unreachable
Nonresponse
Incapable
19
Summary of sampling and
response process problems


Sample process: includes only those
units that were in the sampling frame
Response process: includes only those
that would have responded



Were available during interview period,
Were willing to be interviewed, and
Were physically/mentally capable of
providing responses
20
Sampled population

The sampled population is the
collection of all possible units that
would be the outcome of the sampling
and response process


Could have been chosen in a sample, and
Would have provided data during the
response process if sampled
21
Target pop vs. sampling frame
vs. sampled pop (Fig 1.1, p. 4)
Ideally, these 3 representations of the population
are completely overlapping and
nonrespondents and ineligibles do not occur
22
Survey error
Sampling error
Nonsampling error
Survey error model
Total Survey Error =
in an Estimate
Sampling
Error
Due to selecting
a sample instead
of surveying the
entire population
+
Nonsampling
Error
Due to mistakes or
systematic deficiencies
in sampling,
response process,
data processing
Biemer & Lyberg, 2003, Introduction to Survey Quality
24
Sampling error

Sampling error is the difference between
an estimate of a population parameter
(calculated with data from a sample) and the
true population parameter being estimated



True population mean from entire distribution
Estimated mean calculated from the sample
In a sample survey, we are collecting data
from a subset of the population, i.e., we do
not observe the whole population

Estimate for any one sample is unlikely to
perfectly match the population parameter
25
Example





Population mean of N = 28 students for
y = number of textbooks purchased by 2003 Stat 421
students: 4.21 books per student
Randomly select a sample of n = 4 students
Data on number of text books purchased:
2, 6, 3, 5
Estimated population mean: 4.00 books per
student
Two other samples yielded estimates of 3.50 and
4.75 books per student
26
Nonsampling error

Nonsampling error includes all errors in
data collection, processing and estimation
except sampling error
(1) Selection Errors (mismatch between the target
population and the sampled population)

Frame error


Mismatch between target population and sampling frame
Nonresponse error

Inability to obtain data from a sampled unit
27
Nonsampling error
(2)Specification error

Discrepancy between concept of interest and how question
is phrased (get incorrect data because of problem in
question wording)
(3)Measurement error

Errors in data during interview or measurement process
(respondent provided false/inaccurate info, interviewer
made a recording mistake)
(4)Processing error

Errors in a computer program to process data or calculate
estimates that generate an error in the value of the estimate
28
Reducing total survey error


Sampling error
 Choose a sample design that produces precise estimates
Nonsampling error



Choose survey methods that encourage complete and
accurate responses (nonresponse)
Choose a frame close to target population (frame)
Be very careful with questionnaire development
(specification)

Use trained and monitored interviewers (measurement)

Use quality control for computer programs (processing)
29
Census vs. sample

Census: collect data from all N members of
a population



Sampling error vanishes
But nonsampling error for a census is often much
larger than total survey error for a sample
Sample: devote more effort to collecting high
quality data from fewer sampled units


Control sampling error via good sample design
Implement more expensive, but more accurate
data collection methods
30
Ch 1, problem 2







A student wants to estimate the percentage of mutual funds whose
shares went up in price last week. She selects every tenth fund listing
in the mutual fund pages of the newspaper. She calculates the
percentage of those in which the share price increased.
Target population
Population unit
Sampling frame
Sampling unit
Define yi , the value of characteristic for element i
Possible sources of selection errors
Possible sources of measurement errors
31
Ch 1, problem 1
The article “What Readers Say About Marijuana” reports that
“more than 75% of the readers who took part in an informal
PARADE telephone poll say marijuana should be as legal as
alcoholic beverages” (Parade, 31 July 1994, 16).
The telephone poll was announced on page 5 of the June 12
issue. Readers were instructed to “call 1-900-773-1200, at
75 cents a call, if you would like to answer the following
questions. Use touch-tone phones only. To participate, call
between 8 a.m. EDT [Eastern Daylight Time] on Saturday,
June 11, and midnight EDT on Wednesday, June 15.”
32
Ch 1, problem 1







Target population
Population unit
Sampling frame
Sampling unit
Define yi , the value of characteristic for element i
Possible sources of selection errors
Possible sources of measurement errors
33
Ch 2: Probability Sampling
and SRS

Establish basic notation and concepts

Population distribution of y


Sampling distribution of an estimator under a design
(this is not the sampled population!!)


This is the object of inference
Use this to evaluate quality of the estimate and make inference
Apply these concepts and learn about estimation
through SRS




Selecting a SRS sample
Estimating population parameters (means, totals,
proportions)
Estimating standard errors and confidence intervals
Determining the sample size
34
Assume ideal setting
(until further notice)


Only sampling error, no nonsampling error
Sampled population = target population

Sampling frame is a perfect representation of
target pop



Data are collected on all sampled units


Sample unit = element
List of all elements, only those that are in target pop
No frame or nonresponse errors
Measurement process is perfect

All responses and measurements are accurate
35
Class example


Suppose we want to make inferences
about 2003 Stat 421 students
Interested in three population
parameters



The average course load of students
The proportion of students who have a cell
phone
The total number of text books purchased
by students this semester
36
Return to population
distribution
Indices for elements

Each element has a unique label or index



U = index set (set of labels) for all elements
in the population


Usually label an element by i = 1, 2, …, N
Alternatively, a label could be name, or SSN
U = {1, 2, …, N }
Sampling frame is a list of labels or indices for
each element in the population

Select indices in the sampling process
38
Characteristic of interest



y is the variable or characteristic of
interest
yi = characteristic of interest for unit i
Set of y values in population


y1 , y2 , …, yN
Class example (3 y ’s)



id
1
2
3
…
27
28
Number of courses
Number of textbooks purchased
Whether or not have cell phone
courses
4
4
2
…
6
5
texts
5
3
2
…
6
6
cell
1
1
1
…
0
0
39
Population distributions for number of
courses, number of books, whether/not
have cell phone
40
Population distribution
parameters

Can also characterize a population
distribution with population
parameters




Mean of y (proportion if y is binary)
Total for y
Variance of y (need this for sample size
determination and expressing precision of
estimates)
Sometimes standard deviation, quantiles
41
Symbols for population
distribution parameters


y U = mean or expected value of y
p = proportion of population having a
particular characteristic





Mean of a binary (0, 1) variable
t = population total of y
S 2 = variance of y
S = standard deviation of y
 = generic parameter for population
distribution of y
42
Parameter: population mean

Examples



Average number of miles driven per week by
adults in US
Average number of errors per client account
Population mean of y (or expected value)
N
E [Y ]  y U 


yi

i
1
N
Measure of central tendency (middle of distn)
Units for the mean is y-units per element
43
Parameter: population mean

What is the population mean number of
textbooks purchased by students?
N
E[Y ]  yU 
 yi
i 1
N

44
Parameter: population
proportion

Proportion (p) of population having a
particular characteristic

Mean of binary (0, 1) variable

1 , if unit i has characteristic
yi  
0 , if unit i doesn' t have characteristic
N

p
yi

i
1
N
45
Parameter: population
proportion


What proportion of students have a cell
phone?
Data: 18 students have a cell phone
N
p
 yi
i 1
N

46
Parameter: population total

Examples

Number of households in a region



Number of deer in Iowa



yi =number of households in area i
N = number of areas in the region
yi =number of deer observed in area i
N = number of observation areas in Iowa
Population total of Y
N
t   y i  Ny U
i 1

Total number of y-units in the population
47
Parameter: population total


What is the total number of books
purchased by students in this class?
Data:
N
t   yi  NyU 
i 1
48
Relationships for population
mean, proportion, total
N
yU 
 yi
i 1
N
t

or p
N
N
t   yi  NyU or Np
i 1
49
Parameter: population
variance

Population variance
N
V [Y ]  S 2 

2
(
y

y
)
 i U
i 1
N 1
Measure of spread or variability in
population’s response values

S is the standard deviation of y

NOT the standard error of an estimate, but is used in the
formula for the standard error and confidence interval of
an estimate
50
Parameter: population
variance

id
1
2
3
…
27
28
What is the population variance for the
number of courses enrolled in per
student? For having a cell phone?
courses
4
4
2
…
6
5
texts
5
3
2
…
6
6
cell
1
1
1
…
0
0
N
V [Y ]  S 2 
2
(
y

y
)
 i U
i 1
N 1
51
Summary of notation for the
population distribution




Basic pop unit: element (i )
Number of units or size of pop: N
Values of the random variable: yi
Parameters: characterize the population distribution
 Mean y U
 Proportion (mean of binary variable) p
 Total t
2 and standard deviation S
 Variance S

Sometimes we will use
population parameter
 to denote a generic
52
Summary of the population
distribution

Population distribution of characteristic of y
is not known, but is the object of inference in
conducting a survey


Select a probability sample, collect data, and
estimate unknown population distribution
parameters using data collected from sample
Population distribution and its parameters are
fixed (constant)

Values never change with design, sample, or
estimator
53