* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Ch 2: probability sampling, SRS
Survey
Document related concepts
Transcript
Ch 2: probability sampling, SRS Overview of probability sampling Establish basic notation and concepts Population distribution of Y : object of inference Sampling distribution of an estimator under a design: assessing the quality of the estimate used to make inference Apply these to SRS Selecting a SRS sample Estimating population parameters (means, totals, proportions) Estimating standard errors and confidence intervals Determining the sample size 1 Assume ideal setting Sampled population = target population Measurement process is perfect Sampling frame is complete and does not contain any OUs beyond the target pop No unit nonresponse All measurements are accurate No missing data (no item nonresponse) That is, nonsampling error is absent 2 Survey error model Total Survey Error Assessed via bias and variance = Sampling Error Due to the sampling process (i.e., we observe only part of population) + Nonsampling Error Measurement error Nonresponse error Frame error 3 Probability sample DEFN: A sample in which each unit in the population has a known, nonzero probability of being included in the sample Known probability we can quantify the probability of a SU of being included in the sample Assign during design, use in estimation Nonzero probability every SU has a positive chance of being included in the sample Proper survey estimates represent entire target population (under our ideal setting) 4 Probability sampling relies on random selection methods Random sampling is NOT a haphazard method of selection Involves very specific rules that include an element of chance as to which unit is selected Only the outcome of the probability sampling process (i.e., the resulting sample) is random More complicated than non-random samples, but provides important advantages Avoid bias that can be induced by selector Required to calculate valid statistical estimates (e.g., mean) and measures of the quality of the estimates (e.g., standard error of mean) 5 Representative sample Goal is to have a “representative sample” Probability sampling is used to achieve this by giving each OU in target population an explicit chance to be included in the sample Sample reflects variability in the population Applies to the sample, but does not apply to the OU/SU (don’t expect each observation to be a “typical” pop unit Can create legitimate sample designs that deliberately skew the sample to include adequate numbers of important parts of the variation Common example: oversampling minorities, women MUST use estimation procedures that take into account the sample design to make inferences about the target population (e.g., sample weights) 6 Basic sampling designs Simple selection methods Simple random sampling (Ch 2 & 3) Systematic sampling (2.6, 5.6) Random start, take every k-th SU Probability proportional to size (6.2.3) Select the sample using, e.g., a random number table “Larger” SU’s have a higher chance of being included in sample Selection methods with explicit structure Stratified sampling (Ch 4) Divide population into groups (strata) Take sample in every stratum Cluster sampling (Ch 5 & 6) OUs aggregated into larger units called clusters SU is a cluster 7 Examples Select a sample of n faculty from the 1500 UNL faculty on campus Goal: estimate total (or average) number of hours faculty spend per week teaching courses Simple random sampling (SRS) Number faculty from 1 to 1500 Select a set of n random numbers (integers) between 1 and 1500 Faculty with ids that match the random numbers are included in the sample 8 Examples - 2 Systematic sampling (SYS) Choose a random number between 1 and 1500/n Select faculty member with that id, and then take every k-th faculty member in the list, with sampling interval k is 1500/n SRS / SYS Each faculty member has an equal chance of being included in sample Each sample of n faculty is equally likely 9 Examples - 3 Probability proportional to size (PPS) With pps design, we assign a selection probability to each faculty member that is proportional to the number of courses taught by a faculty member that semester “Size” measure = # of courses taught by faculty member Faculty who teach more courses are more likely to be included in the sample, but those that teach less still have a positive chance of being included Motivation: faculty that spend more hours on courses are more critical to getting good estimate of total hours spent Data from faculty with higher inclusion probabilities will be “down weighted” relative to those with lower probabilities during the estimation process Typically accomplished using weights for each observation in the dataset 10 Examples - 4 Stratified random sampling (STS) Organize list of faculty by college Allocate n (divide sample size) among colleges so that we select nh faculty in the h-th college Stratum = college Sum of nh over strata equals n Use SRS, e.g., to select sample in each of the college strata Could use SYS or PPS rather than SRS Could have different selection methods in each stratum 11 Examples - 5 Cluster sampling (CS) Aggregate faculty into departments Select a sample of departments, e.g., using SRS Very common to use PPS for selecting clusters OU = faculty member, SU = dept “Size” measure = number of OUs in the the cluster SU Many variants for cluster sampling After selecting clusters, may want to select a sample of OUs in the cluster rather than taking data on every OU E.g., select 15 depts in the first stage of sampling, then select 10 faculty in each dept in a second stage of sampling This is called 2-stage sampling 12 Examples - 6 Complex sample designs (Ch 7) Combine basic selection methods (SRS, SYS, PPS) with different methods of organizing the population for sampling (strata, clusters) Typically have more than one stage of sampling (multi-stage design) Often can not create a frame of all OUs in the population Stratification and systematic sampling are often used to encourage spread across the population Need to select larger units first and then construct a frame This improves chances of obtaining a representative sample Costs are often reduced by selecting clusters of OUs, although cluster sampling may lead to less precision in estimates 13 Notation for target population The total number of OUs in the population (also called the universe) is denoted by N Note UPPER CASE Ideally for SRS, sampling frame is list of N OUs in the pop EX: there are N = 4 households in our class Index set (labels) for all OUs in the population (or universe) is called U U = {1, 2, …, N} A different index set could be our names, or our SSNs Each person has a value for the characteristic of interest or random variable Y , the number of people in the household The value of Y for household i is denoted by yi Values in the population are y1 , y2 , …, yN 14 Notation for sample Sample size is denoted by n Note lower case n is always less than or equal to N (n = N is a census) Index set (labels) for OUs in the sample is denoted by S To select a sample, we are selecting n indices (labels) from the universe U , consisting of N indices for the population U is our sampling frame in this simple setting Labels in S may not be sequential because we are selecting a subset of U 15 Class example Suppose n = 2 households are selected from a population of N = 4 households in the class Randomly select sample using SRS and get 2 and 3 U = {1, 2, 3, 4} S= The data collected on OUs in the sample are values for Y = number of people in the household Data: 16 Summary of probability sampling framework Assumptions (for now) Target population = sampling universe = sampling frame Observation unit = sampling unit N = finite number of OUs in the population U = {1, 2, …, N} is the index set for the OUs in the population Sample n = sample size (n is less than or equal to N ) S = index set for n elements selected from population of N units (S is a subset of U) 17 Conceptual basis for probability sampling Conceptual framework for selecting samples Enumerate all possible samples of size n from the population of size N Each sample has a known probability of being selected P(S) = probability of selecting sample S Use this probability scheme to randomly choose the sample Using the probability scheme for the samples, can determine the inclusion probability for each SU i = probability that a sample is selected that includes unit i 18 Simple example Population of 4 students in study group, take a random sample of 2 students Setting U = {1, 2, 3, 4} N = 4 n = 2 All possible samples of size n = 2 from N = 4 elements Note: n < N and S U 19 Simple example - 2 All possible samples S1 = {1, 2} S2 = {1, 3} S3 = {1, 4} S4 = {2, 3} S5 = {2, 4} S6 = {3, 4} Design is determined by assigning a selection probability to each possible sample P(S1) = 1/3 P(S2) = 1/6 P(S3) = 1/2 P(S5) = 0 P(S4) = 0 P(S6) = 0 20 Simple example - 3 Inclusion probability definition? What is the probability that student 1 is included in the sample? Inclusion probability for student 2, 3, 4? 1 = 2 = 3 = 4 = Is this a probability sample? 21 Population distribution Response variables represent values associated with a characteristic of interest for i-th OU Y is the random variable for the characteristic of interest (CAP Y) yi = value of characteristic for OU i (small y) The population distribution is the distribution of Y for the target population Y is a discrete random variable with a finite number of possible values (<= N values) Use discrete probability distribution to represent the distribution of Y 22 Population distribution - 2 A discrete probability distribution is denoted by a series of pairs corresponding to Value of the random variable Y, denoted by y Relative frequency of the value y for the random variable Y in the population, denoted by P(Y = y) Pair is { y , P(Y = y) } Constructing a probability distribution List all unique values y of random variable Y Record the relative frequency of y in the population, P(Y = y) 23 Class example - 2 Back to # of people in household for each class member What are the unique values in the pop? What is the frequency of each value? What is the relative frequency of each value? Construct a histogram depicting the variation in values 24 Summarizing the population distribution Use population parameters to summarize population distribution Mean or expected value of y (parameter: y ) U Proportion of population having a particular characteristic = mean of a binary (0, 1) variable (parameter: p ) For finite populations, population total of y is often of interest (parameter: t ) Variance of y (parameter: S 2) 25 Mean of Y for population Expected value, or population mean, of Y N yU y i 1 N i t N Mean is in y-units per OU-unit Measure of central tendency (middle of distn) Related to population total (t) and proportion (p) Examples Average number of miles driven per week adults in US Average number of phone lines per household 26 Class example - 3 What is the mean household size for people in this classroom? 27 Total of Y in population Population total of Y N t y i Ny U i 1 Total number of y-units in the population Examples Number of households in market area with DSL yi =1 if household i has DSL, yi = 0 if not N = number of households in market area Number of deer in Iowa yi =number of deer observed in area i N = number of observation areas in Iowa 28 Class example - 4 What is the total number of people living in households of people in the classroom? 29 Proportion Proportion (p) of population having a particular characteristic Mean of binary variable 1 , if OU i has characteri stic yi 0 , if OU i doesn' t have characteri stic N p yi i 1 N t N 30 Class example - 5 What proportion of people in the classroom have a cell phone? 31 Population variance of Y Population variance of Y N V [Y ] S 2 2 ( y y ) i U i 1 N 1 Measure of spread or variability in population’s response values 2 Analogous to in other stat classes Not the standard error of an estimate Note this is CAP S 2 32 Coefficient of variance for Y Variation relative to mean (unitless) S CV yU 33 Class example - 6 What is the population variance for number of people in households of people in the classroom? What is the CV? 34 Summary of population distribution of Y Basic pop unit: OU (i) Number of units or size of pop: N Random variable: Y Parameters: characterize the target population Mean y U Total t Proportion (mean) p Variance S2 Coefficient of variation CV = S / y U STATIC: it is the object of inference and never changes with design or estimator 35 What’s next Population distribution of Y is object of inference Use SRS to select a sample and estimate the parameters of the population distribution How to select a sample Estimators for population parameters of Y under SRS Sample mean estimates population mean N x sample mean estimates population total Sample variance estimates population variance Assessing the quality of an estimator of a population parameter under SRS Sampling distribution Bias, standard error, confidence intervals for the estimator 36 Simple random sample (SRS) DEFN: A SRS is a sample in which every possible subset of n SUs has an equal chance of being selected as the sample every sampling unit has equal chance of being included in the sample Example of an “equal probability” sample Does not imply that a sample in which each SU has the same inclusion probability is a SRS Other non-SRS designs can generate equal probability samples 37 Simple random sampling (SRS) Two types SRSWR (SRS with replacement) SRSWOR (SRS without replacement) Return SU after each step in the selection process Do not return SU after it has been selected Selection probability Probability that a unit is selected in a single draw Constant throughout SRSWR process Changes with each draw in the SRSWOR process NOT an inclusion probability, which considers the probability of drawing a sample that includes unit i 38 SRSWR (SRS with replacement) Selection procedure Select one OU with probability 1/N from N OUs This is the selection probability for each draw Returning selected OU to universe Repeat n times Procedure is like drawing n independent samples of size 1 Can draw a sampling unit twice – duplicate units Unappealing for finite populations – no additional info in having a duplicate unit Useful in theoretical development for large populations 39 Focus: SRSWOR (SRS without replacement) Selection procedure Select one OU from universe of size N with probability 1/N DON’T return selected unit to universe Select 2nd OU from remaining units in universe with probability 1/(N - 1) DON’T return selected unit to universe Repeat until n sampling units have been selected Selection probabilities change with each draw 1/N, then 1/(N -1), then 1/(N -2), …, 1/(N – n +1) 40 SRSWOR (SRS without replacement) Probability of selecting a sampling unit in a single draw depends on number of SUs already selected (conditional probability) On the c-th step of the process, c-1 s.u.s have already been selected for a sample of size n Probability of selecting any of the remaining N – c + 1 s.u.s in the next draw is 1 N c 1 Inclusion probability for SU i (unconditional probability) i n N (see p. 44 in text) 41 SRSWOR (SRS without replacement) Number of possible SRSWOR samples of size n from universe of size N N N! , where x ! x (x 1) (x 2) ... 2 1 n n ! (N n )! Probability of selecting a sample S P (S ) 1 N n (Probability is the same for all samples) 42 Selecting a SRS using SRSWOR Create a sampling frame Determine a selection procedure that performs SRSWOR List of sampling units in the universe or population Assigns an index to each sampling unit Procedure must generate to n unique sampling units such that each SU has an equal chance of being included in the sample Random number generator or table is common basis Need rules to identify when the selected unit is included in the sample or tossed Select random numbers and determine sampled units 43 Using random numbers to select a SRSWOR sample Determine a rule to assign random numbers to the sampling universe index set U Rule must give each unit an equal chance of being included in the sample Select the set of random numbers, e.g., using computer or printed random number table Apply the rule to each random number to determine the sampled OU Check to see if this OU has already been selected If already selected, ignore it Keep going until you have n SUs in the sample 44 Census of Agriculture example Select 300 counties from 3078 counties in the US Sampling frame = ? Generate random numbers between 0 and 1 on the computer N= n= Need n or more random numbers depending on rule Multiply each random number by N = 3078 and round up to the nearest integer Random number = .61663 Multiply random # by N = 3078 x .61663 = 1897.98714 Round up to 1898 Take 1898th county in the frame 45 Estimating population mean under SRS Target population mean yU 1 N yi N i 1 Estimator of y U for SRS sample of size n is the sample mean y 1 n yi n i 1 Note “Estimator” refers to the formula “Estimate” refers to the value obtained from using the formula with data 46 Class example - 7 Estimate the average household size for our classroom 47 Estimating population total Target population total N t Ny U y i i 1 Estimator of t for SRS sample of size n N ˆ t Ny n n yi i 1 48 Class example - 8 Estimate the total number of people living in the households of people in this classroom 49 Estimating population proportion Target population proportion Y takes on values 0 or 1, where 1 means the unit has the characteristic of interest p yU 1 N yi N i 1 Estimator of p for SRS sample of size n pˆ y 1 n yi n i 1 50 Class example - 9 Estimate the proportion of people with cell phones in this class room 51 Estimating population variance Target population variance N V [Y ] S 2 (y i i 1 y U )2 N 1 Estimator of S2 for SRS sample of size n is the sample variance n s2 2 ( y y ) i i 1 n 1 (note lower case s) 52 Class example - 10 Estimate the variance of number of people in households of people in this class room 53 Estimating population standard deviation and CV Standard deviation of Y, S ? Estimator of standard deviation of Y? CV of population distribution? Estimator of CV? 54 What would happen if we took another sample? S= Data = Estimates Mean Total Proportion Standard deviation CV 55 Sampling distribution Need to assess the quality of our estimates Is y a good estimator of y U ? Is p̂ a good estimator of p ? Is s2 a good estimator of S2 ? Use the sampling distribution to assess the quality of the estimator Distribution of estimator over all possible samples EX: distribution of y over all possible SRS samples of size n from a population of size N 56 Sampling distribution Simulation 57 Measures of quality Denote Mean of the sampling distribution is the expected value of the estimator E {ˆ} An estimator is unbiased if E {ˆ} Variance of the sampling distribution V {ˆ} Population parameter as [think pop mean y U ] Estimator of as ˆ [think sample mean y ] Precision: want variance of estimator to be small Coefficient of variance Relative precision: want CV to be small V {ˆ} E {ˆ} 58 Sampling distribution of estimator Basic pop unit: sample selected using a specific design, S Number of units or size of pop: number of possible samples Random variable: estimator of parameter, ˆ Parameters: characterize the quality of the estimator Need probability of selecting sample ! Mean (assesses bias of the estimator), E {ˆ} Variance, SE, CV (assesses precision of estimator) DEPENDS on population parameter, estimator of population parameter, sample design 59 Population distribution Sampling distribution Basic unit: OU (i) Total number of units: N Random variable: character of interest, Y Parameters: characterize the target population Mean y U , proportion p (central tendency) Total t Variance S2, std dev S, CV (spread of distn) STATIC once you identify Y, pop distribtn is the object of inference and never changes with design or estimator Basic unit: sample selected using a specific design, S Total number of units: number of possible samples Random variable: estimator of parameter, ˆ Parameters: characterize the quality of the estimator Mean E {ˆ} (used to assess bias of the estimator) Variance V {ˆ}, SE, CV (precision of estimator) DEPENDS on population parameter, estimator of population parameter, sample design 60 Conceptual framework for a sampling distribution - 1 List out all possible samples of size n from the population of size N A sample is the BASIC UNIT for the population of all possible samples We determine the probability of selecting the sample Unequal probability sample (now) Simple random sample NOTE: sampling distribution depends on the design selected 61 Simple example from earlier lecture (not SRS!) All possible samples S1 = {1, 2} S2 = {1, 3} S3 = {1, 4} S4 = {2, 3} S5 = {2, 4} S6 = {3, 4} Design is determined by assigning a selection probability to each possible sample P(S1) = 1/3 P(S2) = 1/6 P(S3) = 1/2 P(S4) = 0 P(S5) = 0 P(S6) = 0 62 Conceptual framework for a sampling distribution - 2 List Using the n data values associated with each sample, calculate the value of the estimator for each sample The estimator is the random variable of our distribution Example: sample mean y is calculated for each of the possible samples NOTE: the sampling distribution depends on the estimator selected 63 Simple example from earlier lecture - 2 Population values for Y i yi 1 2 3 4 3 5 1 3 All possible samples of size n = 2 S1 = {1, 2}, S2 = {1, 3}, S3 = {1, 4}, S4 = {2, 3}, S5 = {2, 4}, S6 = {3, 4} Values of y corresponding to each sample y1 (3 5) / 2 4.0 y 4 (5 1) / 2 3.0 y2 (3 1) / 2 2.0 y 5 (5 3) / 2 4.0 y3 (3 3) / 2 3.0 y 6 (1 3) / 2 2.0 64 Conceptual framework for a sampling distribution - 3 List Using Sampling distribution is described by pairs of values for estimator from the sample and relative frequency of obtaining that value We are using the steps we used before for creating a discrete distribution 65 Representing the sampling distribution Probability distribution: pairs of {c , P (y c ) } y is a random variable, c is a value of P (y c ) P (S ) S y c y , where : S : y c means " all samples S such that y c " 66 Simple example from previous lecture - 3 Number of possible samples N 4 4 3 2 1 24 6 ( 2 1 )( 2 1 ) 4 n 2 Probability of selecting sample P (S 1 ) 1 / 3 y 1 4.0, P (S 2 ) 1 / 2 y 2 2.0, P (S 3 ) 1 / 6 y 3 3.0, P (S 4 ) 0 y 4 3.0 P (S 5 ) 0 y 5 4.0 P (S 6 ) 0 y 6 2.0 Probability distribution: unique values of y and relative frequency c P (y c ) 2.0 3.0 4.0 67 Conceptual framework for a sampling distribution - 4 List Using Sampling distribution Parameters summarize sampling distribution Mean of sampling distribution Variance, std dev (SE) of sampling distribution CV of sampling distribution 68 Ex: mean and variance of sampling distribution for y - 4 Mean of sampling distribution Same concept of expected value used with population distribution E {y } c P (y c ) c (2.0) 1 1 1 2 9 8 19 (3.0) (4.0) 3.1 6 3.17 6 2 3 6 6 Variance of sampling distribution Use more general formula for variance Later, we’ll use reductions that are easier to calculate V {y } E {(y E [y ]) 2 } (c E {y }) 2 P (y c ) c (2.0 3.1 6 ) 2 1 1 1 (3.0 3.1 6 ) 2 (4.0 3.1 6 ) 2 0.47222 6 2 3 69 What if we took a SRS of size n from N units? List out all possible samples N P (S ) 1 / constant for all samples n Calculate estimator for each sample n N! (N n )! n ! Determine the probability of a sample # possible samples: N Examples: y or tˆ or pˆ Create a discrete probability distribution Calculate summary parameters For y , E{y } and V{y } For tˆ , E{tˆ} and V{tˆ} 70 Back to example with SRS Number of possible samples N 4 4 3 2 1 24 6 ( 2 1 )( 2 1 ) 4 n 2 Probability of selecting sample P (S 1 ) 1 / 6 y 1 4.0, P (S 2 ) 1 / 6 y 2 2.0, P (S 3 ) 1 / 6 y 3 3.0, P (S 4 ) 1 / 6 y 4 3.0 P (S 5 ) 1 / 6 y 5 4.0 P (S 6 ) 1 / 6 y 6 2.0 Probability distribution: unique values of y and relative frequency c P (y c ) 2.0 3.0 4.0 71 Example: mean of sampling distribution for y under SRS Mean of sampling distribution E {y } c P (y c c) 1 1 1 (2.0) (3.0) (4.0) 3 3 3 9 3.0 3 Mean of population distribution yU 1 N N yi i 1 12 3.0 4 1 (3 5 1 3) 4 72 Bias of an estimator Estimation bias of ˆ Bias[ˆ] E {ˆ} - Note that this is the mean of the estimator (from sampling distribution) minus the population parameter (from population distribution) If Bias[ˆ] 0 then ˆ is said to be an unbiased estimator of 73 Variance of sample mean under SRS Don’t have to use the general formula Variance of sample mean (derived stat using theory) S2 n V [y ] 1 , where n N S 2 1 N y i y U N 1 i 1 2 is the population variance 2 n Similar to infinite population formula Has an extra factor called the finite population correction factor (FPC) 74 Example Variance of sampling distribution for y 1 N 2 1 2 2 2 2 S ( y y ) ( 1 3 ) 2 ( 3 3 ) ( 5 3 ) i U N 1 i 1 3 4 1 n 22 V {y } 1 S 2 1 0.3333 N 4 3 2 Other measures of dispersion for sampling distribution SE{ y} V { yS } 0.3333 0.5774 V { yS } 0.5774 CV { y} 0.1925 E{ yS } 3 75 S2 n V [y ] 1 n N Finite population correction factor (FPC) n FPC 1 N Sampling fraction is the proportion of the population sampled, or n/N Larger sample Larger fraction of population Smaller FPC Smaller variance of sample mean 76 Impact of FPC on estimated variance of parameter estimate Often FPC is very close to 1 Sample of 3000 households from total of 1,200,000 households n 3000 Sampling fraction 0.00025 1,200,000 n 3000 FPC 1 1 1 .00025 .99975 N 1,200,000 N In cases where sampling fraction is very small and FPC is very close to 1, FPC has no practical effect on the SE or estimated variance of the param estimate Sampling fraction n/N is not a good measure of whether your estimate will be precise The sample size n is the most important part of the variance or SE formulas given variance s 2 77 Estimating population variance under SRS Do not know variance of population distribution, S Unbiased estimator for S 2 1 N 2 2 s y y i n 1 i 1 Estimator for V [y ] 2 2 s n ˆ V [y ] 1 n N ^ Note that SE ( y ) Vˆ[ y ] is the standard error of the sample mean 78 Ag example Interested in average number of acres per county devoted to farms Sample 300 counties from list of 3078 Collect data and get following summary statistics y 297,897 farm acres per county in 1992 s 2 344,551.9 What are estimated mean and standard error? 79 Rounding rules Always keep all of the digits while you are doing calculations Round only when you get ready to report the result at the end of the calculation … Round the estimated SE to 2 significant digits Round estimate to precision of the SE 107,789 is rounded to 110,000 0.0325329 is rounded to 0.033 If SE is 110,000, round estimate to nearest 10,000 (xx0,000) If SE is 0.033, round estimate to nearest 1/1000 (x.xxx) Estimated variances are usually reported to 5 significant digits 80 Sampling distribution for y using SRS of size n from N y is an unbiased estimator of y U Mean of sampling distribution is always equal to population mean under SRS E {y } y U Variance of y is S2 n V [y ] 1 n N Estimate the variance of y using sample variance s2 s2 n Vˆ[y ] 1 n N 81 Sampling distribution of under SRS tˆ Mean of tˆ for population total t under SRS E {tˆ} E {Ny } N E {y } N y U t Expectation of a linear function of a random variable If a, b are constants & Y , ˆ are random variables, then E {aY b } aE {Y } b E {aˆ b } aE {ˆ} b Is tˆ an unbiased estimator of t ? 82 Sampling distribution of under SRS - 2 tˆ Variance of estimator of total under SRS 2 n S V [tˆ] V [Ny ] N 2V [y ] N 2 1 N n Variance of a linear function of a random variable If a, b are constants & Y , ˆ are random variables, then V {aY b } a 2V {Y } V {aˆ b } a 2V {ˆ} 83 Sampling distribution of under SRS - 3 tˆ Estimator for variance of tˆ under SRS 2 n s Vˆ[tˆ] N 2 1 N n 84 Ag example - 2 Estimated total acres devoted to farms in the US in 1992? Estimated Variance of estimated total? Other measures of dispersion for sampling distribution? Estimated SE 85 Sampling distribution of under SRS p̂ Mean of estimator p̂ for population proportion p under SRS E {pˆ} Is p̂ unbiased for p ? 86 Sampling distribution of under SRS - 2 Variance of sample proportion theory) p̂ (derived stat using N n p (1 p ) ˆ V [p ] n N 1 p (1 p ) n Very similar to infinite population formula Extra factor arises from finite pop and is NOT the same as the FPC Estimator does have the FPC in the formula n pˆ(1 pˆ) ˆ ˆ V [p ] 1 N n 1 87 Ag example - 3 Suppose we are interested in the proportion of counties with fewer than 200,000 acres devoted to farms in 1992 Data from our sample of 300 indicate that 153 counties have less than 200,000 acres devoted to farms Estimated population proportion? Estimated SE of estimated proportion? 88 Quality of estimates (Fig 2.2, p. 29) Estimator under a given design is unbiased Estimator under a given design is precise On average over a large number of samples, the mean of the estimates “hit” the target population parameter (centered on the bull’s eye) Over a large number of samples, estimates will tend to be close to one another, indicating that the variance of the sampling distribution for the estimator is small Clump pattern, but may not be centered on bull’s eye (precise but biased) Estimator under a given design is accurate Estimator comes close to hitting target and is precise Assess this with the mean squared error (MSE) 89 Mean Squared Error an Estimator ˆ Mean squared error (MSE) of ˆ 2 2 MSE[ˆ] E ˆ V [ˆ] Bias[ˆ] Combines measures of bias and precision to provide an index of the accuracy of an estimator under a given design Sometimes we are willing to accept a little bias to get a more precise estimator, MSE is improved If Bias[ˆ] 0 then MSE[ˆ] V [ˆ] 90 MSE of SRS estimators All of these estimators are unbiased under SRS (Bias = 0) So under SRS MSE[y ] V {y } MSE[ pˆ] V { pˆ} MSE[tˆ] V {tˆ} 91 Confidence intervals Estimate variance, SE, CV, MSE of estimator under a design to provide indication of quality of estimate Another approach Estimate a confidence interval to express precision of estimate 92 Book example 2.7, p. 35-6 True parameter value: t = 40 CI of interest: [tˆ 4seˆ(tˆ) , tˆ 4seˆ(tˆ)] List 70 possible samples of size n = 4 Each sample has a probability of selection P(S) For each sample, record value of a variable u that indicates whether CI from sample S includes t = 40 u (S ) 1 , if 40 [tˆ 4seˆ(tˆ) , tˆ 4seˆ(tˆ)] 0 , if 40 [tˆ 4seˆ(tˆ) , tˆ 4seˆ(tˆ)] Confidence coefficient: 1 70 P (S k )u k k 1 0.77 93 Ex – 2: Assume SRSWOR If 60 of the 70 SRSWOR samples resulted in CIs that included the true total, what is the confidence coefficient? What is alpha? 94 What is a 95% confidence interval (CI) under SRS? Heuristic definition Take repeated samples of size n from population of size N Collect data on Y Calculate an estimate of a population parameter using data from n observations Calculate 95% CI for parameter estimate using data from n observations Expect 95% of the CIs to contain the true value of the parameter 95 Interpreting CIs in general More generally (for any design), a (1-)100% CI has the interpretation There is a (1-)100% chance of selecting a sample for which the CI will include the true population parameter Note The upper and lower limits of the CI are random variables, calculated from the sample data The true parameter value is either included or not included in a single CI Confidence coefficient of a CI has a relative frequency interpretation across samples 96 Confidence interval definition Standard estimator for a (1-)100% confidence interval (CI): ˆ z / 2 seˆ(ˆ) or equivalently [ˆ z / 2 seˆ(ˆ) , ˆ z / 2 seˆ(ˆ)] 97 Standard normal distribution Z ~ N(0, 1) Z is the random variable Mean E{Z} = 0 and variance V{Z} = 1 Two-sided (1-)100% confidence interval Use critical value z / 2 P Z z / 2 98 Infinite vs. finite populations In other stat classes … Assume SRS with replacement from infinite pop Justify CI by applying the Central Limit Theorem (CLT) In sample surveys, we have a finite number of possible samples Can calculate exact confidence coefficient 1- for a stated interval (see previous example) In practice, it is not possible to list all possible samples, so we have a special CLT that relies on a “superpopulation” framework 99 Superpopulation framework Asymptotic framework for SRSWOR in finite populations Population is part of a larger superpopulation There is a a series of increasingly larger superpopulations Use superpopulation concept to derive a Central Limit Theorem for SRSWOR Bottom line We will use the standard CI estimator with a different theoretical justification 100 When is CLT justified? Confidence coefficient is approximate Quality of approximation depends on n and the distribution of the underlying random variable, Y “n is large enough for CLT” is less clear for finite populations n = 30 rule in other stat classes does NOT apply Rules of thumb If distribution of Y is close to normal, n = 50 Need larger n if distribution of Y deviates from normal, e.g., skewed Y categorical: if p is proportion with characteristic of interest, np 5 and n(1-p) 5 101 Determining sample size – a general approach Specify tolerable error (level of precision, level of confidence) Identify appropriate equation relating tolerable error (e, ) to sample size (n) Estimate unknown parameters in equation Solve for n Evaluate (and return to first step) Can you afford sample size? What expectations can be altered? 102 Specify tolerable error Two parameters e : margin of error or half-width of CI : [1-]100% is confidence level Absolute expression (half-width of CI): estimate within e of true pop parameter P ˆ e 1 Relative expression: ˆ within 100e% of ˆ P e 1 103 Equation linking e, , and n Most common equation is half-width of CI e z / 2 SE [ˆ] Example: sample mean under SRSWOR e z /2 Note S2 n n 1 N z / 2S 2 n0 n 2 n0 z / 2S 2 1 e N N for z / 2S 2 n0 e2 For p , use S2 p(1-p) For = 0.05, use z / 2 2 n0 is sample size under SRSWR (ignoring FPC) 104 Estimate unknowns: population variance of y, S2 Use estimator for variance, s2 Pilot study Previous study Use CV from previous study Careful about comparability Careful about comparability Guess variance under normality estimate of S = range for 95% of values / 4 estimate of S = range for 99% of values / 6 105 Estimating unknowns: population proportion, p Use estimates from pilot or previous study If know nothing of true proportion Use p = 0.5 Max possible variance for estimated proportion under SRS, so this is conservative Commonly used 106 Practicalities for determining n Sampling fraction rarely important Most populations are large enough that sampling fraction n/N is small for practical values of n Subpopulations should influence sample size 95% CI for a proportion ( = 0.05, p = 0.5) Implies e 1 / n n = 400 for e 0.05 n = 100 for e 0.10 n = 50 for e 0.15 (whole sample) (subpopulation) (subpopulation) n = 500 for e 0.04 107 (little gain over 400) SRS: pros and cons Cons SRS is rarely the “best” design May not have list of all OUs need different design May have additional info on pop to create a more efficient design (improve precision) Pros / uses Standard stat procedures can be used with little or no bias Mainly interested in regression rather than estimating pop params (ignore sample design – but could still get a better sample) 108