Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
WhitePaper Towards the Use of Bayesian Credibility Intervals in Online Survey Results Alan Roshwalb, Neale El-Dash, and Clifford Young To w a r d s T h e U s e O f B a y e s i a n C r e d i b i l i t y I n te r v a l s I n O n l i n e S u r v e y R e s u l t s Motivation Introduction Survey researchers and survey statisticians have been dealing with compromise since the very beginnings of the field. From the start we have been trading off quality versus cost and results versus intuition. We survey researchers and survey statisticians always ask ourselves what is appropriate for the quality and the cost. Should we use a door-to-door multi-stage area sample versus a telephone sample, how big a sample should we use, how many callbacks do we attempt, do we use a single frame or a multiple frame, can we achieve our goals using a mail study, will data be better if it were self-administered or interviewer administered, or is an incentive appropriate and how much? Each question asks us to assess the gains, the pitfalls, potential bias, loss or gain of information and its costs. Each decision can lead us towards or away from the perfect probability sample. Most of us review the data as it comes out of the field. We often ask ourselves, does it look right? Does it feel right? Who can forget the overwhelming support for Pat Buchanan in Palm Beach County, FL in the 2000 Presidential election? It didn’t look right, it didn’t feel right, and in the end, there were so many inconsistencies in the electoral ballot form design, it wasn’t right.1 At the roots of survey research is to sample and collect information from the target population through a sample from the population. The purpose of sampling is to reduce the cost and/or the amount of work that it would take to survey the entire target population. Since only a portion of the target population is measured, survey results need to be generalized, allowing for statements about the entire population to be made. Included in a generalization is a measure of uncertainty with respect to the estimate. If the estimate has a high degree of uncertainty, then there is a wide range of values that can be the true population value. If the estimate has little uncertainty attached to it, then the range of plausible values is narrower, and hence the estimate is more precise. The most common measure of uncertainty is the “confidence interval” (or “margin of error”) of the estimate. It describes the level of confidence the true population parameter falls within an interval around the sample estimate. It is often used as a measure of survey estimate’s reliability. As a result, surveys are traditionally published using two statistics: the survey estimate and the confidence interval. An example of such practice is to say “For results based on these sample sizes, the maximum margin of error is ±3 percentage points with 95% confidence”. The correct interpretation of this statement is: if the poll were repeated 100 times, 95 of those times the true percentage of people giving a particular answer to the question would fall within the 95% Confidence Interval, i.e., would be within 3 points of the sample estimate. This paper tries to consolidate these competing decisions, our understanding and intuition into a reasonable framework. Here we propose a departure from “classical” approaches to a Bayesian approach. Our goal is to provide a framework to increase our understanding, use our intuition and cumulated knowledge and interpret data that very often we would dismiss. The 95% Confidence Interval as a measure of reliability relies on the theory of statistical sampling where p(s) are the probabilities of inclusion for all elements of the population. We refer to this as the “classical” approach. Expectation of statistical estimates including the parameter estimate and variances are made over these probabilities. When data are missing, either through non-response or through frame coverage, the classical approach requires the researcher to make assumptions such that estimates are valid or even calculable. Assumptions include randomized nonresponse in the event of non-response or minimal effect of non-coverage in the case of less than 100% coverage in the sample frame. 1 In the 2000 Presidential Election, third-party candidate Patrick Buchanan received an unexpectedly large number of votes from Palm Beach County, FL, a heavily Jewish county. Pat Buchanan had a history of positions considered to be controversial in the Jewish community. On review, it was felt the Palm Beach county butterfly ballot was poorly designed as could confuse voters. 2 To w a r d s T h e U s e O f B a y e s i a n C r e d i b i l i t y I n te r v a l s I n O n l i n e S u r v e y R e s u l t s Probabilistic Sampling and Classical Statistics Theory from finite population sampling and survey statistics allows for a confidence interval (±3% in this example) to be calculated properly only if the sample is selected using probabilistic sampling with known and positive values for p(s). This means that one has to know – with certainty – the probability of each unit in the population being included in the sample, i.e., what percentage of times each unit would be included in a sample if an infinite number of samples were selected. The root of this kind of reasoning is from the field of Frequentist or Classical Statistics. Within the Frequentist framework, probabilities measure the relative frequency of the occurrence of an event. • Some people can be regular users of the Internet only at home, • Some people can be regular users of the Internet only at work, • Some people can be users of the Internet at places other than work or home (e.g. at the library, via mobile device, etc.), and • Some people never use the Internet or do not have access. Door-to-door and telephone surveys often profile the population to account for elements of the population missing from the sample. Less is known about the ‘profiles’ of individuals who complete online surveys versus those who do not and the likelihoods of completing an online survey for those that do. Our industry does not have a standardized approach to these metrics yet. As practitioners know, the idea of a ‘true’ probability sample is utopian and does not exist in reality – we simply seek to get as close as is possible. Gold standard probability samples have always had non-probabilistic elements: The survey research community has categorized the nature of online surveys. Most are based on self-recruited sample frames, either through Internet river sampling or volunteering, and the respondents create a panel of potential respondents for current and future studies. These studies are now referred to as panel-based, ‘opt-in’ samples. They are often characterized as nonprobability samples, convenience samples or samples based on other motivations to be part of the study. This is a ‘layer’ of non-probability, since the construction of the sample frame is a subset of the population, and there is no practical intention on including the entire or substantial portion of the population in the sample frame. • Door-to-door studies – insufficient maps, missing dwelling units, access restricted neighborhoods or apartment buildings. • RDD telephone studies – Cell phone households, residential telephone numbers in zero-listed blocks in RDD telephone samples, or households consistently using call blocking or screening2. • Internet studies – non-connected households etc. Internet studies are perhaps more overtly non-probabilistic than door-to-door or telephone samples because a panel, in effect the sampling frame, may or may not have been constructed to include every member of the population. Internet frames rely on one of several means of constructing their frame. Some use river sampling – advertising on the Internet and asking people to join the “panel,” i.e., their sample frame, some use lists of known Internet users and actively recruit their membership in their panels, others allow people to apply to their panel without active recruitment, and others use non-Internet lists to recruit. In most of these cases, the probabilities of being included in the sample frame are unknown, very difficult to ascertain, or many are just zero, for example, non-Internet users. Troubling is that the nature of use of the Internet is not uniform within the population, so this limits our ability to calculate the likelihood of reaching a person by online contact in the same way we do (with some caveats) on the telephone. The reasons vary. It could be because internet connections and usage patterns are highly variable across households such as: Strictly speaking from a classical perspective, one cannot calculate the confidence interval for Internet opt-in based surveys because it is not possible to assign a probability of selection to each population member, and many have a probability of selection of zero. Simplistically put, most online surveys are considered non-probability samples. So shall we give up? Many of the problems with online samples are shared with the other approaches. Shall we damn online samples simply because they are deprived of measures of scientific precision? We do not believe so. Even though online samples from opt-in panels are not probabilistic, they do still provide useful information and even have a good track record for accuracy in surveying the population (e.g., election polling and market research surveys). However, in one way or another, users need to be able to ascribe the relative robustness and uncertainty of a survey estimate. This cannot be done, however, within the Frequentist framework. 2 Most of these issues are discussed in “Advances in Telephone Survey Methodology”, James M. Lepkowski, Clyde Tucker, J. Michael Brick, Edith D. De Leeuw, Lilli Japec, Paul J. Lavrakas, Michael W. Link, Roberta L. Sangster, Wiley series in survey methodology. 3 To w a r d s T h e U s e O f B a y e s i a n C r e d i b i l i t y I n te r v a l s I n O n l i n e S u r v e y R e s u l t s Non-Probabilistic Sampling and Bayesian Statistics In the context of an online sample design, an online sampling design is ignorable implies that there is NOT a relationship between the variables we want to measure, denoted here by Y, and the probability that a person completes the online survey. This is a strong assumption: it would be hard to verify and it may not be true. A weaker, more realistic, assumption is to say that the sampling mechanism is ignorable conditional on X. This means that once we control for all the issues described above, we believe that the variables Y which we want to measure do not influence the probability that one person completes the survey. This can be interpreted as assuming that the estimates are unbiased under a super-population model conditioned on the relationship between Y and X. Regardless of the design, randomness is interjected into the relationship through the model. Moving forward requires us to step outside of the Frequentist framework and involve the field of Statistics known as Bayesian Statistics. The theoretical basis for Bayesian Statistics is that the uncertainty of an estimate and, therefore, its precision, can be based on what we know about the world, and what we know about the world can be based on opinions, expertise, other data sources, and updates to this knowledge based on the data we collect ourselves3. Bayesian statistics relies on models. These models either through Bayes Theorem or super-population models allow us to generalize from a sample to the population parameters of interest4. The sample may come from a probability or a non-probability sample. The model under the Bayesian framework ultimately controls for those variables that are systematically different, correcting for the unbalanced samples due to non-response, non-coverage or other nonrandom nature of designs. Ipsos is developing steps to ensure that the design issues in our online samples are conditionally ignorable5. These steps are discussed in the Ipsos paper “Our Brave New World: Blended Online Samples and the Performance of Non-Probability Approaches”6. But, to summarize, conditional ignorability can be achieved by combining multiple opt-in online panel and non-panel sources, which we refer to as blended online samples. In essence, by combining multiple sample sources, the “holes” in any one sample source can theoretically be filled out by any one of the others. All of the necessary adjustments and models are incorporated in X. Implicit in X are discrepancies and inadequacies from the design, and the biases from mode effect or measurement errors. Under ignorability, an important concept in Bayesian statistics, one does not need to know the actual probabilities of selection. Formally, the sample design is considered ignorable, if the joint prior distribution of the parameter and the response mechanism is independent of each other, or it is taken into account in the model to allow for sensible inferences about the population parameter. In the context of this paper, the population parameter is θ. 3 For a discussion of probabilistic sampling from a Bayesian perspective see “Randomization in a Bayesian perspective”, J. B. Kadane and T. Seidenfeld. Journal of Statistical Planning and Inference, 25:329-345, 1990. 4 For a famous example on the use of Bayesian statistics to obtain population estimates see “Subjective bayesian models in sampling finite populations”, W. A. Ericson. Journal of the Royal Statistical Society - Series B (Methodological), 31(2):195-233, 1969. 5 For a discussion of ignorable sample designs within the context of non-random sampling see “On the validity of inferences from non-random sample”, T. M. F. Smith., Journal of the Royal Statistical Society. Series A (General), 146(4):394-403, 1983 6 Young et. al. (2012) “Our Brave New World: Blended Online Samples and the Performance of Non-Probability Approaches”, Ipsos White Paper. 4 To w a r d s T h e U s e O f B a y e s i a n C r e d i b i l i t y I n te r v a l s I n O n l i n e S u r v e y R e s u l t s Bayesian Credibility Intervals Credibility Intervals are often stated in a very similar way as Confidence Intervals, but the interpretation is quite different, and in many ways more intuitive. For example, the statement, “For results based on these sample sizes, the credible intervals are ±3 percentage points with 95% credibility” should be interpreted as: Given the knowledge base of a practitioner and the results of the survey data he has collected, the probability that the true percentage of people giving a particular answer in the whole population is within ±3 percentage points of the survey estimate is 95%. The discussion in this section uses the Bayesian framework and assumes that the sample design is conditionally ignorable. Bayesian probability intervals are also known as Credibility Intervals. They measure the degree of certainty one has in the value of a parameter based on one’s experience, understanding and knowledge of the population and tempered by the data that has been observed. Every probability statement, in this context, is based on what is known at the time the probability is stated. One’s experience, understanding and knowledge is termed the “knowledge base” for the purposes of this paper. Examples of information that may be included in one’s knowledge base within the context of survey sampling include all other current published surveys, past surveys by the same pollster, historical political outcome data, a practitioner’s political expertise, predictive models and other sources. One’s knowledge base is not static, since it is continuously being updated with new information. For most survey, marketing and polling research estimates, credibility intervals are based on BetaBinomial posterior distribution. This first assumes that Y has a binomial distribution conditioned on the parameter θ\, i.e., Y|θ~Bin(n,θ), where n is the size of our sample. In this setting, Y counts the number of “yes”, or “1”, observed in the sample, so that the sample mean ( ) is a natural estimate of the true population proportion θ. This portion of the model is often called the likelihood function, and it is a standard concept in both the Bayesian and the Classical framework. The posterior distribution in Bayesian statistics takes the likelihood function and combines it with our prior distribution. Using our prior Beta distribution, we achieve our Beta-Binomial posterior distribution. The Beta-Binomial posterior distribution is also a beta distribution (π(θ/y) ~ ß (y+a,n–y+b)). It is the hyper-parameters of the prior distribution, i.e., one’s knowledge base, updated using the latest survey information9. In other words, the posterior distribution represents our opinion on which are the plausible values for θ adjusted after observing the sample data. In the Bayesian framework, one’s knowledge base is one’s Prior Distribution, and is mathematically denoted by the function πθ, where θ can be viewed as the population quantity in which we are interested7. For the remainder of this paper, we will look at a specific family of prior distributions. For the purposes of this document, θ is a proportion which assumes values between 0 and 1. This may reflect the proportion supporting a particular voter initiative or candidate. The family of prior distributions we are considering assumes a beta distribution. In effect, πθ~ß(a,b) is a useful representation of our prior knowledge about the proportion θ. The quantities a and b are called hyper-parameters, and are used to express/ model one’s prior opinion about θ. In other words, judicial choice of a and b can restate one’s belief that the parameter is nearer to 25% (a=1 and b=3), near to 50% (a=1 and b=1) or nearer to 75% (a=3 and b=1)8. The choices of a and b also defines the shape of the probability curve, with a=1 and b=1 denoting a uniform or flat distribution. In effect, this is equivalent to believing that the true value of θ has the same chance of being any value between 0 and 1. The hyper-parameters a and b are not limited to a known constant. They too can be modeled as random quantities. This adds flexibility to the model, and it allows for data-based approaches to be considered, such as Empirical and Hierarchical Bayes. 7 For a detailed discussion about subjective probabilities see “Theory of Probability” by Bruno de Finetti (translation) 2 volumes, New York: Wiley, 1974-5. 8 The expected value of a Beta Distribution is a/(a+b). 9 For a discussion how to calculate one’s posterior distribution see “Bayesian Data Analysis - Second Edition”, A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin, Chapman and Hall/CRC, 2003. 5 To w a r d s T h e U s e O f B a y e s i a n C r e d i b i l i t y I n te r v a l s I n O n l i n e S u r v e y R e s u l t s Our credibility interval for θ is based on this posterior distribution. As mentioned above, these intervals represent our belief about which are the most plausible values for θ given our updated knowledge base. There are different ways to calculate these intervals based on π(θ/y). One approach is to create an estimator analogous to what is done within the Classical framework. In this case, the credible interval for any observed sample is based on a prior distribution that does not include information from our knowledge base. This case occurs when we assume that the parameters of the beta distribution are a=1, b=1 and y=n/2. Essentially, these choices provide a uniform prior distribution where the value of θ is equally likely on the range between 0 and 1. In effect, our knowledge base is equally sure or unsure that the true value is near zero, 25%, 50%, 75% or 100%. Using a simple approximation of the posterior by the normal distribution, the 95% credible interval is given by, approximately: This result points to one interesting consequence. If the sample size is large enough with an uninformed prior, the credibility interval almost coincides with the confidence interval10. We need to assume random sampling was used. This occurs only when the information included in the sample dominates one’s prior, the information that one has at hand from one’s knowledge base. Choosing parameters for a more informed prior may be arrived at using empirical methods, but let’s say we were to choose a Beta prior with a=64, b=64 and y=n/2. This is akin to saying at worst we believe population parameter is near 50% and concentrated enough that 2 standard deviations under the prior distribution is ± 8.8%. The posterior distribution under this scenario has posterior variance, θ(1 – θ)/(n + 129). When θ=1/2, the Credible Interval is approximately . Under this scenario, the parameter estimate ӯ has a 95% Credible Interval “tighter” than a 95% Confidence Interval. Here we have gained by using our knowledge base, i.e., the true value is near 50%. Bayesian analysis allows us to construct the 95% credible interval even in the case of a non-probability sample under the ignorability assumption. 10 The 95% Confidence Interval would be 6 To w a r d s T h e U s e O f B a y e s i a n C r e d i b i l i t y I n te r v a l s I n O n l i n e S u r v e y R e s u l t s Non-Ignorable Design Discussion Ignorable designs assume that sample elements are missing from the sample when the mechanism that creates the missing data occurs at random, often referred to as missing at random or completely missing at random. Basically, the likelihood function conditioned on the parameter θ is independent of the missing data mechanism. Since the process for updating the prior distribution for θ uses this conditional relationship between the data and the parameter θ, a design that is non-ignorable has data generated by a parameter other than θ, say θ1. In other words, the process may naively update θ without accounting for the difference between θ and θ1. Under our framework, this is model misspecification. Non-ignorable missing data requires additional analysis to decompose θ across the population segments into a set of θ = {θa , θb , … θz}. The prior under this usually assumes a multinomial distribution. This is called a non-ignorable adjustment cell model11. Under this model, adjustment cells are constructed and updates occur across cells. It will require analyses of existing data sources to identify cells where the θ parameters may be different. The purpose of the approach is to create cells where ignorability exists, and then we can rely on this model to construct our credibility interval. How different is the Bayesian approach and its corrections from classical concerns and approaches? Simplistically, not very. Ipsos is using the Bayesian approach to frame the problem in such a way to allow us to use the wealth of information available outside of the current survey to calibrate survey results and provide statistical foundation for inference even when a probabilistic sample is unavailable. The missing data framework found in the work of Little and Rubin provides a means of dealing with non-probability samples. When examining conditions for ignorable designs, model misspecification is very similar to biases that arise from poor construction of a sample frame. If we consider optin panels as a sample frame construction issue, the question is whether or not a random sample from the panel is unbiased or the level of bias. In our discussion we have discussed how the Bayesian approach incorporates potential bias from the sample frame construction within the estimation (θ versus θ1), allows us to incorporate data from even non-probability sample (Bayesian updating under conditionally ignorable designs), and provide a probabilistic assessment of uncertainty under these circumstances (posterior distribution and its credible intervals). To paraphrase Elliot and Little12, one criticism of “their” work is that it is explicitly Bayesian. Hence it incorporates subjective elements through the choices of model and prior. Every method in modern survey sampling requires subjective assumptions even when data are unadjusted or rejection of methods. “The Bayesian framework makes these assumptions explicit and open to debate, rather than implicit in the estimation algorithm.” We hope this paper accomplishes this – opens the discussion and use of data from the blended online samples together with the large quantities of information available outside of the study. Here it is important to note that models to adjust within cell discrepancies can be employed either at the presurvey design stage or post-survey weighting adjustment stage. For the most part, we have focused our attention on the latter solution (see “Our Brave New World: Blended Online Samples and the Performance of Non-Probability Approaches” for further detail on such an approach). 11 See Little and Rubin, Statistical Analysis with Missing Data, New York, NY: Wiley, (1987). 12 Elliott, M.R. and Little, R.J.A., “A Bayesian Approach to Combining Information from a Census, a Coverage Measurement Survey and Demographic Analysis,” Journal of the American Statistical Association, (2000), 95, 450, 351-362. 7 Contact About Ipsos Public Affairs For more information, please contact: Ipsos Public Affairs is a non-partisan, objective, surveybased research practice made up of seasoned professionals. We conduct strategic research initiatives for a diverse number of American and international organizations, based not only on public opinion research, but elite stakeholder, corporate, and media opinion research. Alan Roshwalb Senior Vice President, US, Ipsos Public Affairs 202.420.2029 [email protected] Clifford Young President, US, Ipsos Public Affairs 202.420.2016 [email protected] Ipsos has media partnerships with the most prestigious news organizations around the world. In Canada, the U.S., UK, and internationally, Ipsos Public Affairs is the media polling supplier to Reuters News, the world’s leading source of intelligent information for businesses and professionals. Ipsos Public Affairs is a member of the Ipsos Group, a leading global sur vey-based market research company. We provide boutique-style customer service and work closely with our clients, while also undertaking global research. To learn more, visit: www.ipsos-na.com Copyright© 2016 Ipsos Public Affairs. All rights reserved. 1 6 - 0 3 - 0 9