Download Sampling Theory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
IIMC Long Duration Executive Education
Executive Programme in Business Management
Statistics for Managerial
Decisions
Sampling Theory
Prof. Saibal Chattopadhyay
IIM Calcutta
A Brief Review
• Uncertainty and Randomness: Theory of Probability
• Decision Making Under Uncertainty: Utility Theory
• Random Variables & Probability Distributions:
Binomial, Poisson, Normal, Exponential
• Joint Distribution of Two Random VariablesMarginal Distributions, Mean, Variance,
Covariance, Correlation Coefficient,
Independence of random variables
• Regression Approach to the analysis of a
bivariate data – Curve fitting and Least Squares
Principle
Sampling Theory
• Census Vs. Sampling
• Judgment Sampling Vs. Probability
Sampling
Different Probability Sampling Procedures
• Simple Random Sampling – With
Replacement (SRSWR) & Without
Replacement (SRSWOR)
• Stratified Random Sampling
• Systematic Sampling
Preliminary Concepts
•
Finite Population:
N units having values Y1, Y2, …, YN
• Parameter:
A function of the population values
Examples:
• Population Mean =  =  Yi /N
• Population SD =  =  (Yi - )2/N
• Population Proportion = P
Simple Random Sampling With
Replacement (SRSWR)
•
•
•
•
•
•
•
n units to draw from N units
Unit drawn is returned before next draw
All possible choices are equally likely
Nn possible samples of size n each
Each sample has probability 1/Nn
Same unit may repeat in the same sample
Values of sampled units are random
variables !
SRSWR
Denote the sample values as y1, y2, …, yn.
Consider y1 (the first sample value)
This could be any one of the N values of
the population
y1 takes each of the values Y1, Y2, …, YN
with probability 1/N.
Thus
P(y1 = Y1) = P(y1 = Y2 ) = …. = P( y1 = YN )
= 1/N.
SRSWR
What about y2?
Sampling done with replacement;
Composition of the population unchanged;
Second sample value y2 is identically
distributed as y1
True for all subsequent sample values
• Sample Values are identically distributed
P(yi = Y1) = P(yi = Y2 ) = …. = P( yi= YN ) = 1/N,
for all i = 1, 2, …n.
SRSWR
Are the sample values independent?
P( yi = Y1 and yj = Y2) = 1/N2
P( yi = Y1) = 1/N & P( yj = Y2) = 1/N
 yi and yj are independent
True for all pairs of values
• Sample Values are identically distributed
• Independent and identically distributed
(IID) random variables
SRS Without Replacement (SRSWOR)
• n units to draw at random from N units
• Unit once drawn is not returned before
drawing the next unit
• All possible choices are equally likely
• NCn possible samples of size n each
• Each sample has probability 1/NCn
• Units in a sample are all distinct
• Values of sampled units are random
variables !
SRSWOR
Are the sample units still identically
distributed ?
For y1 the distribution is same as SRSWR
What about y2 ?
P(y2= Y1 | y1 = Yi) = 0 if i = 1;
= 1/(N-1) otherwise
P(y2= Y1) = (1/N).0 + (N-1). (I/N).(I/(N-1))
= 1/N, same as in SRSWR !
• Yes; units are identically distributed
SRSWOR
•
•
Are the sample units still independent?
P( y1 = Y1, y2 = Y2) = 1/N(N-1),
but
P(y1 = Y1) = 1/N = P(y2 = Y2)
Y1 and y2 are not independent
True for all sample values
No - Sample units are not independent
What about their dependence?
SRSWOR
Are the sampled units uncorrelated?
• No; Covariance between any two of them
is - 2/(N-1);
What is a Statistic?
• A function of the sample values;
Examples
• Sample Mean
• Sample SD
• Sample Proportion
SRSWOR
•
•
•
•
•
•
A Statistic is a Random Variable
Probability Distribution of a Statistic –
Called Sampling Distribution
Mean of a Statistic – Called Expectation
SD of a Statistic – Called Standard Error
(SE)
Role of SE – compares efficacy of
different sampling procedures
Smaller the SE, better is the sampling
Sampling Distribution of Sample Mean in
Simple Random Sampling
•
•
•
•
•
•
•
•
Finite Population of size N
Population mean =  and SD = 
Random Sample of size n drawn (WR/WOR)
Statistic is Sample Mean
Expectation =  (both SRSWR and SRSWOR)
SE = /n for SRSWR
SE = (/n).( FPC) for SRSWOR
FPC = Finite Population Correction = (N-n)/(N-1)
Comparing SRSWR and SRSWOR
• For n =1, FPC = 1, so SRSWR and
SRSWOR are equivalent
• For n > 1, FPC < 1, so SRSWOR is better
than SRSWR
• Limiting Behaviour: As N becomes large
with n fixed, both sampling methods are
asymptotically equivalent
---- Intuitively Obvious !
Can we use SRS always?
• SRS is too fair !
• Ignores typical composition of a population
Example: Suppose the Population is
characterized by sex – males and females
• N1 males and N2 females in the population
• N1 ‘too large’ compared to N2, say at least
80% are males
• Will an SRS be representative here?
Drawback of SRS
•
•
•
•
Possible not; most likely sample will have
too few females; may be none at all !
Not a representative sample, at least for
a social survey
Need representations of all sections of
the society
How can we ensure that?
Divide the population into several parts!
Stratified Random Sampling
Population has N units: N1 of first type (males), N2
= N – N1 of a second type (females)
Total Sample Size = n
• Divide n into two parts, n1 and n2
• Draw n1 samples from N1 units
• Independently draw n2 samples from N2 units
• Use SRS for drawing the units from the subpopulations (strata)
• Combine the two sub-samples to get a Stratified
Random Sample of size n = n1 + n2
How to Choose n1 and n2 ?
Proportional Allocation:
• Choose n1 and n2 proportional to the subpopulation sizes N1 and N2
• n1 = (n/N).N1 & n2 = (n/N).N2
Optimum Allocation:
• Choose n1 and n2 proportional to the subpopulation SD’s 1 and 2
Systematic Sampling
• Units are arranged in a sequence
• N = n.k; numbered 1 – N; sample size = n
• Divide the population into n groups of k
consecutive units each
• Select one unit at random from the first group
with units 1 – k
• Select every k-th unit thereafter
• k possible samples; probability of each=1/k
• Gives a sample uniformly spread over the
population
Central Limit Theorem
•
•
•
•
•
•
Sampling from a normal population
Mean =  and SD = 
SRS of size = n (With Replacement)
Statistic = Sample Mean
Expectation = ; SE = /n
Z = (Sample Mean - )/(/n) is N(0, 1)
What happens if sampling is done from a nonnormal distribution?
Distribution of sample mean is no longer normal
though formulae for Expectation & SE are still
true
Can we say anything more?
Yes, provided the sample size n is ‘large’ !
How large is ‘large’ ? n  30 will do !!
What happens if n is ‘large’ ?
• Distribution of sample mean is still normal, but
only approximately
• Approximation is better and better as n becomes
larger and larger
• Always true regardless of the underlying
distribution from which sampling is done
 Central Limit Theorem
Multistage Sampling Methodologies
• Generally used to counter presence of nuisance
parameter (unknown)
• Used in situations where the optimal sample size
required is not known a-priori
• Sampling done in two or more stages
• First Stage: Select a ‘small’ sample m
• Use this pilot sample to get an estimate E of the
unknown sample size
• STOP if E is less than m
• Second Stage: Select a second sample of size
E – m otherwise
Some Standard Sampling Distributions
1. Chi-Square Distribution
• n IID N(0, 1) variables: Z1, Z2, …, Zn
• Y = Sum of Squares of Z1, Z2, …, Zn
= Z21+ Z22 + … + Z2n
• Y is Chi-Square with n degrees of freedom
(d.f)
• Mean = n; SD = 2n
• (Y – n)/ 2n is Standard Normal for large n
• Distribution is positively skewed; probability
table available
Some Standard Sampling Distributions
2. t – distribution
• Z is N(0, 1)
• Y is Chi-Square with d.f = n
• Z and Y are independently distributed
• Sampling Distribution of t = Z/(Y/n) is
called the t-distribution with d.f = n
• Similar to N(0, 1); Approaches N(0,1) as
sample size n is large ( n  30); Probability
tables for n < 30 available
Some Standard Sampling Distributions
3. F – distribution
• Y1 is Chi-Square with d.f = n1
• Y2 is Chi-Square with d.f = n2
• Y1 and Y2 are independently distributed
• Sampling distribution of
F = (Y1/n1)/(Y2/n2)
is F distribution with d.f = (n1, n2)
 Useful for Hypothesis-Testing problems
when we have samples available from a
normal population (exact or approximate)
References
Text Book for the Course
• Statistical Methods in Business and Social
Sciences: Shenoy, G.V. & Pant, M.
(Macmillan India Limited)
Suggested Reading
• Applications of Sequential Methodologies:
Mukhopadhyay, Nitis, Datta, Sujay &
Chattopadhyay, Saibal. (Marcel Dekker,
New York, 2004).