Download THE RELIABILITY OF THE ANES FEELING THERMOMETERS: AN

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

State (polity) wikipedia , lookup

Political spectrum wikipedia , lookup

Opinion poll wikipedia , lookup

Jon Krosnick wikipedia , lookup

Transcript
THE RELIABILITY OF THE ANES
FEELING THERMOMETERS: AN
OPTIMISTIC ASSESSMENT
Robert N. Lupton
Michigan State University
[email protected]
William G. Jacoby
Michigan State University
[email protected]
Prepared for presentation at the 2016 Annual Meetings of the Southern
Political Science Association. San Juan, Puerto Rico, January 7, 2016.
ABSTRACT
The feeling thermometer is a rating-scale survey question intended to measure affective responses
toward a stimulus object, such as a political candidate or party. Items using this format have
been included on the American National Election Study’s interview schedule since 1964. Recently,
however, they have been criticized for being “notoriously unreliable.” To our knowledge, however,
the reliability of the feeling thermometer items has never been tested directly. We employ methods
for three-wave panel data to test the reliability of the thermometer items included in the 1980 NES
Major Panel Study and the ANES 2000-2002-2004 Panel Study. We show that most thermometer
items are extremely reliable, both in absolute terms and relative to other commonly-used items
from the ANES surveys. To the extent that there are thermometer ratings with lower reliability
levels, they are due more to the stimuli than to a problematic item format. For these reasons,
we believe the feeling thermometers can continue to be used as measures of affective responses to
candidates, parties, groups, and other political phenomena.
Feeling thermometers are ubiquitous in the study of American political behavior. But, despite
their wide usage, the feeling thermometer survey items have been criticized for possessing questionable measurement characteristics. While this criticism has taken several forms over the years, a
recent and particularly salient charge is that the feeling thermometers are “notoriously unreliable”
(Broockman, Kalla, Aranow 2015). Interestingly, however, the reliability of the individual feeling
thermometer items apparently has never been tested directly. This seems like a major omissions
for survey questions that are used so widely.
The current study attempts to fill this void. Using a methodology for measuring the reliability of
a single item with three-wave panel data, we find that the feeling thermometers are quite reliable.
And, variations in reliability seem to be explained by the nature of the stimulus objects being
evaluated, rather than the format of the thermometer items. In the interest of brevity, the analysis
reported below only uses data from the American National Election Studies’ (ANES) 1980 Major
Panel Study and the ANES 2000-2002-2004 Panel Study. But, it is important to emphasize that
the results generalize: Virtually identical findings are produced using the other three-wave panel
studies that the ANES deployed over the years. While there may be other measurement concerns
about the feeling thermometers, item reliability is not a hindrance to using these survey questions
in order to elicit affective responses to stimulus objects.
BACKGROUND
The term, “feeling thermometer” refers to a question format used to obtain rating scale responses
from survey respondents toward specific stimulus objects. These items were first used in the 1964
ANES Pre-Post Survey to obtain evaluations of various sociodemographic and political groups.
Starting in 1968, feeling thermometers have been used for presidential candidates, other political
figures (e.g., vice-presidential candidates), and governmental institutions (e.g., the U. S. Supreme
Court). While the exact wording has varied somewhat, depending upon the year and the interview
format (face-to-face or over the telephone), a typical version of the question is as follows:
Please look at page 2 of the booklet. I’d like to get your feelings toward some of our
political leaders and other people who are in the news these days. I’ll read the name
of a person and I’d like you to rate that person using something we call the feeling
thermometer. Ratings between 50 degrees and 100 degrees mean that you feel favorable
and warm toward the person. Ratings between 0 degrees and 50 degrees mean that you
don’t feel favorable toward the person and that you don’t care too much for that person.
You would rate the person at the 50 degree mark if you don’t feel particularly warm or
cold toward the person. If we come to a person whose name you don’t recognize, you
don’t need to rate that person. Just tell me and we’ll move on to the next one.
The booklet page mentioned in the question contains a stylized rendering of a thermometer,
with nine labelled points ranging from 100◦ (labelled “Very warm or favorable feeling”) through
50◦ (labelled “No feeling at all”) to 0◦ (labelled “Very cold or unfavorable feeling”). In principle,
respondents can answer a thermometer question with any number between zero and one hundred;
thus, the format could be regarded as a 101-point scale. In reality, most respondents give integer
responses that end in zero (i.e., 50, 60, etc.), leading many researchers to treat the thermometers
as nine- or eleven-point scales.
The feeling thermometers have proven to be extremely popular, and they (or close variants) have
been deployed by several different survey research organizations beyond the ANES (e.g., Alwin 1997;
Delavande and Manski 2010; Adida, Davenport, McClendon 2015). Feeling thermometers have
been used for many different purposes, including measuring attitudes toward single candidates and
political figures (e.g., Reynolds 1974; Feldman 1988; Adida et al. 2015); modeling vote intentions
for single candidates (Delavande and Manski 2010) or between a pair of candidates (Markus and
Converse 1979; Page and Jones 1979; Jacoby 1999); developing multidimensional representations
of ideological orientations (Conover and Feldman 1981) and party identification (Weisberg 1980),
operationalizing spatial theories of voting (e.g., Rabinowitz 1978; Rabinowitz and Macdonald 1989),
modeling public perceptions of presidential candidates (Weisberg and Rusk 1970; Jacoby 1986;
Jacoby and Armstrong 2014), and modeling perceptions of, and affect toward, sociodemographic
groups (Zinni, Mattei, Rhodebeck 1997; Kalkan, Layman, Uslaner 2009). Thus, analyses of feeling
thermometer responses have played a major role in studies of the mass public’s orientations toward
politics and the social world more generally.
Potential Measurement Problems
Given their widespread usage, it may be somewhat surprising to note that the measurement
characteristics of feeling thermometer ratings remain open to some serious questions. For example,
most analysts treat the thermometers as if they provide interval-level measurement— although few
say so explicitly. Nevertheless, several researchers have suggested that the function relating “true”
2
feelings about stimulus objects to thermometer ratings of those objects is monotonic rather than
linear in form (e.g., Rabinowitz 1976; Winter and Berinksy 1999; Jacoby 1999), thereby implying
that the proper level of measurement is ordinal. Another potential problem is that different people
may use the thermometer scale in different ways. For example some may use the full range of
possible values from zero to 100 to express their feelings toward a battery of stimuli, while others
confine their ratings to a narrower interval along the 101-point scale. Or individuals may make
finer distinctions about their feelings toward objects they like than they do toward objects they
dislike. In either case, this is the problem of interpersonal incomparability of scores, or differential
item functioning. Several studies have used indirect evidence to suggest that this may be occurring
among thermometer ratings (e.g., Brady 1985; Wilcox, Sigelman, Cook 1989; Winter and Berinsky
1999).
Still another potential problem involves the effects of random measurement error on the thermometer scores. To the extent that this occurs, it lowers the reliability of the measurement. And,
recent events suggest that this might be the case: Broockman, Kalla, and Aronow (2015) state
that “Feeling thermometers are notoriously unreliable survey items. That is, in a technical sense,
subject’s responses to feeling thermometers typically contain a fairly large amount of random measurement error” (page 14). Now, the paper in which this statement appears is unpublished and
currently only available online. But it still has had a major impact on the social science research
community: It was precisely this supposed unreliability that led Broockman and his colleagues
to question the integrity of the data analysis in the article, “When Contact Changes Minds: An
Experiment on Transmission of Support for Gay Equality,” by Michael J. LaCour and Donald P.
Green (2014).
Broockman et al. show that there are implausibly high test-retest correlations between feeling
thermometers for “gay men and lesbians” on different waves of a panel study used in the LaCour
and Green article. The average of the six pairwise coefficients (across the four waves of the panel
study) is 0.952. We agree completely that such correlations are far too large to be believable.
But, we are more hesitant about the motivation that led to this conclusion. Broockman et al.
provide no evidence about the poor measurement properties of the feeling thermometers beyond
the conventional wisdom or common knowledge that seems to be implied by the term “notoriously
3
unreliable” (emphasis added). We believe that more systematic investigation is necessary before
tarring a seemingly useful battery of survey items with such a broad brush.
We start with the general assumption that no measurement is perfect— that is, random errors
always occur, regardless of the exact context of the measurement. Hence, all measures are less than
perfectly reliable. In the context of survey research, feeling thermometers are intended to reduce
measurement error by giving respondents a “longer” continuum on which to place themselves
than is the case with three-, five-, or seven-point bipolar response scales (often called “Likert-type
items”). If respondents really do represent their true feelings about the stimulus object with greater
precision, then the feeling thermometer format should reduce random measurement error. But,
many researchers have noted that, despite the potential for 101 different scores, the vast majority
of respondents place themselves at one of the nine labelled points on the feeling thermometer card or
at one of the eleven values that end in zero (e.g., Rabinowitz 1976; Alwin 1992; 1997; Broockman et
al. 2015). The resultant “lumpy” distributions of scores may suggest that people are not capable of
making the fine distinctions implied by the relatively continuous thermometer format (e.g., Winter
and Berinsky 1990). If that is the case, then error is introduced, again reducing the reliability of
the resultant measure. Thus, there probably is reason for concern about the reliability of the feeling
thermometers.
Prior Research on Measurement Error and Reliability
Previous research on this topic is surprisingly limited. Green (1988) used a structural equation
model with feeling thermometers as manifest indicators of latent variables to test multidimensional
representations of party identification and ideology. He shows that conclusions of multiple partisan and ideological dimensions probably were incorrect and due to the combination of random
and nonrandom measurement error in the feeling thermometers. Nevertheless, Green’s results also
show that the feeling thermometers perform at least as well as other indicators of latent partisan and ideological affinities. Furthermore, they show robust correlations with relative candidate
preferences— which are measured by taking differences between feeling thermometer ratings of the
Democratic and Republican presidential candidates. So, while there certainly is measurement error
present, the variables work in a manner consistent with theoretical expectations. Green concludes
that “(t)he evidence . . . speaks well not only for the validity of the measures used in our model but
also for the substantive importance of group evaluations as a topic of study” (1988, page 776).
4
Krosnick (1991) examined the temporal stability of symbolic versus non-symbolic attitudes.
The former category included party identification and ideology. Krosnick included attitudes toward social groups in the latter category, and six of his seven measures were based on thermometer
ratings. He reported that “the party (and) ideology . . . reliabilities are higher than the . . . social
groups reliabilities . . . which in turn are significantly larger than the efficacy and trust reliability”
(page 561). But, any comparison for present purposes is inconclusive, because Krosnick also used
feeling thermometers to measure party identification and ideological orientations. Thus, the reliabilities of the thermometer-based measures range from high to moderate levels; it is difficult to
draw more precise conclusions from the results presented in Krosnick’s article.
Probably most relevant to the present study, Duane Alwin carried out several analyses comparing the reliabilities of feeling thermometer items to the reliabilities of survey items with fewer
response categories. One study used three-wave panels from the ANES (Alwin 1992) and the other
employed a multitrait-multimethod approach to examine items from the University of Michigan’s
1978 Quality of Life Survey (Alwin 1997). The results in both studies showed that feeling thermometers produce more reliable measures than do response formats with fewer categories. Nevertheless,
there are limitations to the utility of these findings for present purposes. Specifically, the NES
results are only presented as average reliabilities across a set of items; therefore, it is impossible
to discern from Alwin’s article the reliability of specific thermometer items. And, the Quality of
Life Survey had respondents use the thermometers to rate their satisfaction with various aspects
of their own lives. This is a very different substantive context than the evaluations of political
stimulus objects for which the feeling thermometers are used by most political scientists.
A recent major study of measurement error and reliability in survey questions focuses on the
methodological and substantive benefits of combining responses to individual items into multipleitem measures (Ansolabehere, Rodden, Snyder 2008). Several of the examples used in this analysis
combined feeling thermometers with other items to form relatively reliable measures of issue attitudes. We certainly agree that the benefits of multiple-item measures are enormous. But, the use
of such measures implies that the scores on the constituent items are not of intrinsic interest because they are merely error-laden replications of the attitude object being tapped by the multi-item
measure. Thus, Ansolabehere et al. really do not address the reliability of the feeling thermometer
items per se, other than asserting that each one is relatively unreliable when taken on its own.
5
To summarize, the overall number of relevant studies is very small. And several of those had
different objectives than evaluating the reliability of the feeling thermometers. Furthermore, the
results often are not presented in great detail, making it impossible to evaluate the reliability of
thermometers toward specific objects. In several of the studies, the feeling thermometers, themselves, are regarded as limited manifest indicators of more general concepts. We believe the latter
assumption may be problematic because each feeling thermometer is intended to measure a person’s attitude toward a specific stimulus object. The stimuli represent important elements of the
political, such as presidential candidates, prominent public figures, political parties, governmental
institutions, and sociodemographic groups. Therefore, we believe that more attention needs to be
given to the reliabilities of the feeling thermometer items.
METHODOLOGICAL STRATEGY AND DATA
Following standard principles of measurement theory, we assume that a feeling thermometer
score, shown as X, can be broken down into two components as follows:
Xi = Ti + Ei
(1)
In equation (1), Xi is respondent i’s thermometer rating of some stimulus, Ti is i’s “true” attitude
toward the stimulus, and Ei is the error or discrepancy between the observed thermometer score
and the actual attitude. Taking the variance of X produces:
2
2
σX
= σT2 + σE
+ 2σT,E
(2)
The last term on the right-hand side of equation (2) is the covariance of the true attitudes and the
errors. But, let us assume (following fairly standard practice) that the two are uncorrelated. In
2 ) breaks down
that case, the last term in equation (2) drops out, and the total observed variance (σX
neatly into two components, the variance in the true attitudes ((σT2 ) and the variance of the errors
2 ). Dividing both sides by σ 2 shows that the total variance in the observed thermometer scores
(σE
X
can be divided into the prortion that is due to variance in the true attitudes, and the proportion
6
that is due to error. The former is defined as the reliability of the feeling thermometer, X, when it
is used as a measure of the attitude toward the object:
RelX =
σT2
2
σX
(3)
Of course, equation (3) cannot be evaluated directly, because the true attitude, T , is unobserved.
Therefore, it must be estimated indirectly.
If X were a scale, defined as a linear combination of multiple indicators of T , then the reliability
of X can be estimated by evaluating the internal consistency of the indicators, defined as the intercorrelations among the constituent indicators. This is precisely the basis for Cronbach’s coefficient
α, by far the most widely-used estimate of scale reliability (e.g., Nunnally and Bernstein 1994).
Unfortunately, we cannot use this approach because we do not have multiple indicators. Again, we
are regarding each thermometer as a single measure of attitude toward the stimulus object.
With single measures, the usual approach is to use test-retest correlations to obtain estimates
of reliability. The logic is that, if a question is administered to respondents at two time points, and
the underlying attitude does not change during the interval between the two administrations, then
the only source of empirical change is error. Hence, one minus the squared correlation between the
thermometer scores at the two time points should give the proportion of the observed variance that
is due to error. And, since the variance breaks down into two additive components, the squared
correlation itself is the proportion due to “true” variance, or the reliability.
Again, the preceding logic is based on the assumption that the true attitude is perfectly stable.
In fact, it might change over time. And, with only two administrations of the feeling thermometers,
there is no way to differentiate substantive change in the true attitude from error fluctuations in
the scores. On the other hand, if the thermometers are administered at three time points, then
there are several ways to differentiate temporal stability in the true attitude from reliability in the
thermometer scores. Let j be an index for the time point, so that j = 1, 2, 3. Now, if we are willing
to assume that the Tj ’s and xj ’s are both standardized to unit variance at each time point, that
Tj is related to Tj−1 by a single-lag autoregressive process, and that the reliability is constant (i.e.,
it does not vary across the time points) then Heise (1969) showed that the reliability of X can be
estimated from the correlation matrix for the Xj ’s. Wiley and Wiley (1970) relax some of Heise’s
7
assumptions. They continue to assume the single-lag autoregressive process across the true scores
2 , is constant over time. But they argue that it is more realistic
and also that the error variance, σE
to assume that the variances of the true attitudes do change over time. This, implies that there
can be different reliabilities for Xj at each time point, rather than a single global estimate for X.
Wiley and Wiley (1970) show that maximum likelihood estimates of the three reliabilities can be
obtained from the covariance matrix for the Xj ’s. We will use the Wiley and Wiley approach to
estimate the reliabilities of feeling thermometer items and, for comparison, several other items from
some of the ANES panel studies.
Over the years, the ANES has conducted five three-wave panel studies that include feeling thermometers for the same stimulus objects over all three waves.1 Here, we will analyze data from two
of these panel studies. First, the 1980 ANES Major Panel Study interviewed survey respondents in
January, June, and September 1980. Thus it measures individual attitudes across the course of a
presidential campaign. The 1980 Major Panel Study includes eleven feeling thermometers toward
individual political figures (most of whom were presidential candidates), six more for the Democratic Party, the Republican Party, Liberals, Conservatives, Independents, and Political Parties
(generally). The Study also includes the standard seven-point party identification index and the
seven-point ideological self-placement variable, along with four additional issue attitudes measured
on seven-point bipolar scales.
The second data source is the 2000-2002-2004 ANES Panel Study which includes 29 feeling
thermometers that were administered on all three waves. Seven of these are for individual people,
two are for Liberals and Conservatives, four tap attitudes toward governmental institutions, and 16
measure evaluations of social groups. This Panel Study also includes the party identification index
on all three waves. And there are questions asking respondents whether federal spending should
be increased, decreased, or kept at current levels in each of ten policy areas.
Currently, we examine only the 1980 and 2000-2002-2004 panel studies in order to keep the
paper relatively brief. But, future iterations will incorporate data from all five ANES panel studies that contain suitable data. For now, we want to emphasize that the results we present below
1
Feeling thermometers were included in all three waves of the 1972-1974-196 ANES Panel Study, the 1980 Major
Panel component of the 1980 ANES, the 1990-1991-1992 ANES Panel Study, the 1992-1994-1996 ANES Panel Study,
and the 2000-2002-2004 ANES Panel Study.
8
definitely are generalizable. Reliabilities for the feeling thermometers from the other three panel
studies are very similar to those we analyze here.
EMPIRICAL RESULTS
The Wiley-Wiley estimates of the reliabilities for the feeling thermometer items in the 1980
Major Panel Study are given in Table 1, and they are depicted graphically in Figure 1. The
reliabilities range over a fairly wide interval, from a minimum of 0.395 (for Political Parties in
general in the June 1980 panel wave) to a maximum of 0.940 (for George H. W. Bush in the
January 1980 wave). But, the central tendency for the 1980 feeling thermometers leans toward
the upper range, with a mean reliability of 0.732. On average, nearly three-fourths of the variance
in the thermometer ratings is shared with the underlying attitudes toward the respective stimulus
objects. This certainly appears to be an acceptable level of reliability for survey data, where
previous research showed much lower average reliabilities for a variety of different kinds of survey
items (Krosnick 1991; Alwin 1992).
Admittedly, the feeling thermometers are not perfect measures. Here, the mean reliability still
leaves 26.8% of the observed variance in the feeling thermometers as error. Of course, we want the
error variance to be as small as possible. So, is this too much? Like judgments about the size of
the R2 in a regression equation, it is difficult to evaluate particular reliability values on their own.
Therefore, it is useful to present some comparative evidence. Table 2 and Figure 2 present the
reliabilities for the other items from the 1980 Major Panel Study.
As expected, the reliabilities for the party identification index are consistently high, ranging
from 0.862 to 0.870. Reliability is lower for ideological self-placement, with coefficients ranging
from 0.635 to 0.657. The reliabilities for the issue questions range from 0.498 (for programs to
help Russia in September 1980) to 0.724 (for defense spending in June 1980), with a mean value of
0.616. So, on survey questions about policy issues, just over 60% of the observed variance is shared
with the true attitude, on average. Of course, this is quite a bit lower than the mean for the feeling
thermometers.
Let us next consider the results for the 2000-2002-2004 Panel Study. The reliabilities for the
feeling thermometer items are presented in Table 3 and Figure 3. Here, the coefficients vary over a
wider range, from a minimum of 0.338 (for evaluations of Blacks in 2004) to a maximum of 0.888
9
(for evaluations of Dick Cheney in 2004). The mean reliability for these thermometers is 0.597. In
the twenty-first century data, the average feeling thermometer variable only shares about 60% of
its variance with the corresponding underlying attitude.
What accounts for the apparent difference between the results obtained for the two datasets?
One obvious possibility is the differing sets of stimulus objects to which the thermometers were
applied. In the 1980 data, 11 of the 15 stimuli were individual figures, and the remaining four
were related to the political parties. In the 2000-2002-2004 data, only seven out of 29 stimuli
were individual political figures; the rest were either related to the federal government or were
sociodemographic groups. Even a quick glance at Figure 3 reveals that the objects in the latter two
subsets tend to have lower reliabilities than do those in the first subset. Specifically, the reliabilities
for individual figures range from 0.706 (for Jesse Jackson in 2002) to 0.888 (again, for Cheney in
2004), with a mean value of 0.787. The coefficients for the remaining stimulus objects range from
0.338 (again, for Blacks in 2004) to 0.745 (for fundamentalists in 2004), with a mean of 0.537.
Thus, the critics may have a point when it comes to evaluations of institutions and groups. For
those stimuli, the feeling thermometers share just over half of their variance (i.e., 53.7%) with the
underlying attitudes. But, the much higher reliabilities for the thermometer ratings of individual
figures strongly suggests that the problem lies with the nature of the stimulus objects rather than
with the feeling thermometer format, itself.
Once again, some comparative evidence should prove useful. Table 4 and Figure 4 provide the
2002 through 2004 reliability estimates for the party identification index and for the 10 questions
about federal spending on specific policies. Party identification again shows extremely high reliability, with coefficients from 0.898 in the first panel wave to 0.912 in the third wave. In contrast,
the spending questions have much lower and more variable reliabilities. The coefficients for latter
items range from a very low 0.263 (for spending on programs to reduce crime in 2004) to a more
acceptable 0.699 (for spending on programs to help Blacks in 2000). The mean reliability for the
spending items is 0.509. We believe these results are quite telling. The reliabilities for another
battery of widely-used survey questions from the ANES (i.e., the spending items) tend to be quite
a bit lower than the reliabilities for the feeling thermometers— even taking into account the fact
that a number of the latter are also quite low. Overall, the feeling thermometers come off very well
in comparisons to other survey questions.
10
While the evidence from the two ANES datasets provides very few examples of unacceptably
low reliabilities among the feeling thermometers, the fact remains that the reliability does vary
markedly from one stimulus object to the next. And this begs the question of why the variability
exists. As we saw earlier, with the 2000-2002-2004 data, the reliabilities for institutions and groups
are almost uniformly lower than those for individual people. This suggests that the reliability for a
given thermometer is affected by the nature of the stimulus object being evaluated. But rather than
focusing on substantively-defined categories of the objects (i.e., groups versus individuals, etc.), we
will consider three characteristics that are common to all of the objects being evaluated with the
feeling thermometers.
First, there is the degree of polarization with respect to the object. That is, how divided
are the ANES respondents in their opinions about the “target” of the feeling thermometer. This
characteristic is measured very easily by simply taking the variance of the thermometer ratings for
each object. Polarization is hypothesized to have a positive effect on item reliability, since more
extreme opinions typically are more strongly held and resistant to change.
Second, there is the degree of opinionation with respect to each object. Stated differently, this
is the extent to which people possess nonneutral feelings about each stimulus object. Opinionation
also is expected to have a positive effect on reliability, since a neutral response implies that a person
really has no affective reaction one way or the other with respect to the object. Opinionation is
measured by taking the proportion of ANES respondents who give feeling thermometer responses
that are either greater or less than 50 (i.e., the neutral value on the thermometer scale).
Third, there is the degree of consensus in feelings about the stimulus object. Here, consensus is
defined as a situation in which most respondents report the same general evaluation of an object.
For example, there appears to be a consensus of very positive feelings toward the elderly. This
stimulus almost uniformly receives very high thermometer scores, with mean values of 80.815,
76.316, and 78.514 in 2000, 2002, and 2004, respectively. Of course, there is variability around
each of these central tendencies; the corresponding standard deviations are 17.500, 17.697, and
17.282, respectively. So even the lower evaluations of the elderly will still tend to fall above the
neutral point of 50. We hypothesize that consensus will be negatively related to reliability. Our
reasoning is that, if everyone feels the same way about a given stimulus object, then variability
in ratings will tend to stem from random fluctuations rather than meaningful differences in the
11
thermometer scores. For each stimulus, consensus is measured by summing the thermometer scores
greater than fifty, summing the scores less than fifty, and then taking the absolute value of the
difference between the two sums.2 To the extent that thermometer scores are either consistently
positive or consistently negative, the absolute difference between the two sums should be greater
than zero.
Polarization, opinionation, and consensus are used as independent variables in a regression
equation to predict the reliability of the thermometers. The OLS estimates of the model parameters
are obtained separately for the two ANES datasets. Note that the three “observations” for each
stimulus object— corresponding to the three waves of each panel study— are not independent of
each other. Therefore, robust standard errors are reported, clustered on the 15 stimuli in the 1980
data, and on the 29 stimuli in the 2000-2002-2004 data. Preliminary results for the 1980 data
revealed that the three observations for George H. W. Bush are outliers that are highly influential
on the estimates. Therefore, a dummy variable is included in the 1980 model coded one for these
three observations and zero otherwise. Finally, inspection of diagnostic plots revealed a strongly
nonlinear relationship between polarization and reliability. This pattern appeared in both datasets.
Accordingly, the regressor for that variable in each model is actually -1 times the inverse of the
polarization value for each stimulus object. This nonlinearity makes it a bit more difficult to
describe the relationship between polarization and reliability. But, it definitely provides a more
accurate depiction of the structure in the data than does a linear functional form.
The OLS estimates for the regression models are shown in Table 5. Note that the model fit is
very good in both datasets, with R2 = 0.739 in the 1980 data, and R2 = 0.834 in the 2000-2002-2004
data. The standard deviation of the residuals is 0.060 for the 1980 model and 0.059 for the 20002002-2004 estimates, showing that the average accuracy with which reliability can be predicted
from the regressors is basically identical for both equations. The signs on the coefficients for the
three main independent variables are all consistent with prior expectations. And, again, the results
are very similar for the two equations. Polarization and opinion both show significant positive
effects on reliability while consensus has a significant negative impact. The outlying observations
for Bush show up clearly with the highly significant positive coefficient for that variable.
2
This variable’s values are divided by 10,000 in order to make the coefficients in the regression model more
reasonable.
12
Given the transformations carried out on two of the independent variables, the coefficients are
somewhat difficult to interpret on their own. Therefore, effects plots for the two models are shown
in Figures 5 and 6. Each panel in the two figures shows the predicted reliability across the range of
one independent variable, with the other independent variables held at their mean values. The first
panel in each figure shows how polarization corresponds to increasing reliability, with the rate of
increase getting smaller across the range of the polarization variable. Although discernible in both
panels, the nonlinearity is especially pronounced in the 2000-2002-2004 data. Similarly, the overall
effect of polarization is greater in the latter dataset. In 1980, reliability increases approximately
0.15 units, from about 0.65 to approximately 0.8 across the range of polarization values. In the
later dataset, the total increase is about 0.4 units, from about 0.36 to about 0.76. The linear effects
of the other two independent variables are clearly evident in the remaining two panels within each
figure. Note also that the effects for each of these variables are quite similar across the datasets.
The results from these regressions show that the reliabilities of the feeling thermometer variables
definitely are affected by the nature of the stimulus object being evaluated. Divisions of opinion—
captured here by the polarization variable and inversely by the consensus measure— lead to greater
systematic structure in the thermometer scores and higher reliability. Similarly, the attitudinal
salience of the stimulus object, as manifested in the opinionation variable, promotes reliability.
Reasonably enough, sizable segments of the mass public must have non-neutral feelings about an
object in order to provide meaningful thermometer evaluations about the object. In summary, the
thermometer reliabilities vary in fairly obvious ways. There is nothing in the regression results to
suggest that they are particularly suspect representations of affective reactions toward the stimulus
objects.
CONCLUSION
The results from this study should provide strong reassurance to the many social scientists
who use feeling thermometers as empirical variables in their research. The charge that the feeling
thermometers are extremely unreliable is simply wrong. The reliabilities for most of the feeling
thermometer variables examined in this study are very high, both in absolute terms and in comparison to the reliabilities for other well-known and commonly-used variables from the ANES. And
in the relatively few cases where reliabilities are low, the problem clearly lies with the nature of
13
the stimulus object rather than the characteristics of the measurement instrument. When used
to evaluate a salient stimulus for which there are sizable differences of opinion within the public,
the feeling thermometers provide very reliable gauges of sentiment. But, when they are applied
to ambiguous objects toward which fewer crystallized opinions actually exist, it should be no surprise that the resultant thermometer scores contain a higher amount of “noise” variation. This is
precisely how any measurement procedure or tool should work.
While the results presented here are clearcut, they definitely are not the final word on the
measurement characteristics of the feeling thermometers. Other issues, such as the measurement
level of the scores and the potential for differential item functioning, still require attention. But
there would be little reason to worry about these latter measurement properties if the thermometer
scores proved to be extremely unreliable— essentially randomly fluctuating numbers rather than
meaningful reflections of individual attitudes. The evidence presented here indicates that definitely
is not the case.
In fact, the findings about reliability should provide even stronger motivation for pursuing the
other measurement characteristics. The large amount of prior research that has been carried out
using the feeling thermometers demonstrates clearly that the latter are related to a variety of other
interesting phenomena. This makes it all the more important to understand more precisely how the
specific thermometer scores correspond to individual attitudes (i.e., the question of measurement
level) and whether one person’s scores can be meaningfully compared to the scores provided by
another person (i.e., the issue of differential item functioning). Scientific progress in any field
depends heavily on the quality with which its central concepts are measured. And, the current
study provides important evidence on exactly that point. The general conclusion is that, if the
feeling thermometers are “notoriously unreliable” then that notoriety is badly misplaced. The
actual data show that the ANES feeling thermometers definitely comprise a reliable measurement
instrument.
14
REFERENCES
Adida, Claire L.; Lauren D. Davenport; Gwyneth McClendon. (2015) “Ethnic Cueing Across Minorities: A Survey Experiment on Candidate Evaluation in the U.S.” Public Opinion Quarterly. Forthcoming.
Alwin, Duane F. (1992) “Information Transmission in the Survey Interview: Number of Response
Categories and the Reliability of Attitude Measurement.” Sociological Methodology 22: 83118.
Alwin, Duane F. (1997) “Feeling Thermometers Versus 7-Point Scales: Which Are Better?” Sociological Methods and Research 25: 318-340.
Ansolabehere, Stephen; Jonathan Rodden; James M. Snyder, Jr. (2008) “The Strength of Issues:
Using Multiple Measures to Gauge Preference Stability, Ideological Constraint, and Issue
Voting.” American Political Science Review 102: 215-232.
Brady, Henry. (1985) “The Perils of Survey Research: Inter-Personally Incomparible Responses.”
Political Methodology 11: 269-290.
Broockman, David; Joshua Kalla; Peter Aronow. (2015) “Irregularities in LaCour (2014).” Unpublished manuscript. Available online at https://people.stanford.edu/dbroock/sites/
default/files/broockman_kalla_aronow_lg_irregularities.pdf
Conover, Pamela Johnston, and Stanley Feldman. (1981) “The Origins and Meaning of Liberal/Conservative Self-Identifications.” American Journal of Political Science 25: 617-45.
Delavande, Adeline and Charles F. Manski. (2010) “Probabilistic Polling and Voting in the 2008
Presidential Election: Evidence from the American Life Panel.” Public Opinion Quarterly74:
433-459.
Feldman, Stanley. (1988) “Structure and Consistency in Public Opinion: The Role of Core Beliefs
and Values.” American Journal of Political Science 32: 416-40.
Green, Donald Philip. (1988) “On the Dimensionality of Public Sentiment toward Partisan and
Ideological Groups.” American Journal of Political Science32: 758-780.
Heise, David R. (1969) “Separating Reliability and Stability in Test-Retest Correlation.” American Sociological Review 34: 93-101.
Jacoby, William G. (1986) “Levels of Conceptualization and Reliance on the Liberal- Conservative
Continuum.” Journal of Politics 48: 423-432.
Jacoby, William G. (1999) “Levels of Measurement and Political Research: An Optimistic View.”
American Journal of Political Science 43: 271-301.
Jacoby, William G. and David A. Armstrong II. (2014) “Bootstrap Confidence Regions for Multidimensional Scaling Solutions” American Journal of Political Science 58: 264-278.
Kalkan, Kerem Ozan; Geoffrey C. Layman; Eric M. Uslaner. (2009) “‘Bands of Others’ ? Attitudes
Toward Muslims in Contemporary American Society.” Journal of Politics71: 847-862.
Krosnick, Jon A. (1991) “The Stability of Political Preferences: Comparisons of Symbolic and
Nonsymbolic Attitudes.” American Journal of Political Science35: 547-576.
LaCour, Michael J. and Donald P. Green. (2014) “When Contact Changes Minds: An Experiment
On Transmission Of Support For Gay Equality.” Science 346: 1366-1369.
Markus, Gregory B. and Philip E. Converse. (1979) “A Dynamic Simultaneous Equation Model
of Electoral Choice.” American Political Science Review 73: 1055-1070.
Nunnally, Jum C. and Ira H. Bernstein. (1994) Psychometric Theory (Third Edition). New York,
NY: McGraw-Hill.
Page, Benjamin I. and Calvin Jones. (1979) “Reciprocal Effects of Policy Preferences, Party
Loyalties, and the Vote.” American Political Science Review 73: 1071-1089.
Rabinowitz, George B. (1976) “A Procedure for Ordering Object Pairs Consistent with the Multidimensional Unfolding Model.” Psychometrika 41: 349-373.
Rabinowitz, George B. (1978) “On the Nature of Political Issues: Insights from a Spatial Analysis.”
American Journal of Political Science22: 793-817.
Rabinowitz, George and Stuart E. Macdonald. (1989) “A Directional Theory of Issue Voting.”
American Political Science Review 83: 93-121.
Reynolds, H. T. (1974) “Rationality and Attitudes Toward Political Parties and Candidates.”
Journal of Politics 36: 983-1005.
Weisberg, Herbert F. (1980) “A Multidimensional Conceptualization of Party Identification.”
Political Behavior 2: 33-60.
Weisberg, Herbert F. and Jerrold G. Rusk. (1970) “Dimensions of Candidate Evaluation.” American Political Science Review 64: 1167-1185.
Wilcox, Clyde; Lee Sigelman; Elizabeth Cook. (1989) “Some Like It Hot: Individual Differences
in Responses to Group Feeling Thermometers.” Public Opinion Quarterly53: 246-257.
Wiley, David E. and James A. Wiley. (1970) “The Estimation of Measurement Error in Panel
Data.” American Sociological Review 35: 112-117.
Winter, Nicholas and Adam Berinsky. (1999) “What’s Your Temperature? Thermometer Ratings
and Political Analysis.” Paper presented at the 1999 Annual Meetings of the American
Political Science Association.
Zinni, Frank P., Jr.; Franco Mattei; Laurie A. Rhodebeck. (1997) “The Structure of Attitudes
toward Groups: A Comparison of Experts and Novices.” Political Research Quarterly 50:
595-626.
16
Table 1: Wiley-Wiley reliability estimates for feeling thermometers from the 1980 ANES Major
Panel Study.
Panel wave
January
June
September
Jimmy Carter
0.802
0.834
0.838
Edward Kennedy
0.856
0.829
0.822
George McGovern
0.781
0.702
0.702
Walter Mondale
0.709
0.632
0.690
Gerald Ford
0.717
0.697
0.674
George Bush
0.940
0.927
0.940
Jerry Brown
0.800
0.745
0.694
Howard Baker
0.693
0.643
0.645
John Connally
0.823
0.783
0.780
Ronald Reagan
0.806
0.790
0.812
George Wallace
0.779
0.740
0.736
Democratic party
0.769
0.776
0.729
Republican party
0.718
0.719
0.675
Independents
0.660
0.546
0.520
Parties in general
0.550
0.395
0.506
Table 2: Wiley-Wiley reliability estimates for party identification index, ideological self-placement
variable, seven-point issue scales from the 1980 ANES Major Panel Study.
Panel wave
January
June
September
Party identification
0.862
0.869
0.870
Ideology self-placement
0.635
0.654
0.657
Defense spending
0.701
0.724
0.689
Govt. spending vs. services
0.634
0.637
0.602
Aid to Russia
0.594
0.527
0.502
Inflation vs. Unemployment
0.619
0.612
0.498
Table 3: Wiley-Wiley reliability estimates for feeling thermometers from the 2000-2002-2004 ANES
Panel Survey.
Panel wave
2000
2002
2004
Bill Clinton
0.820
0.812
0.808
Hillary Clinton
0.849
0.832
0.827
Al Gore
0.789
0.726
0.756
Ralph Nader
0.770
0.736
0.736
Jesse Jackson
0.773
0.707
0.708
George W. Bush
0.739
0.767
0.836
Dick Cheney
0.798
0.850
0.889
Liberals
0.635
0.571
0.617
Conservatives
0.646
0.651
0.694
Federal Government
0.573
0.557
0.657
Supreme Court
0.466
0.493
0.489
Congress
0.511
0.463
0.563
Military
0.608
0.529
0.550
Whites
0.405
0.411
0.397
Blacks
0.431
0.376
0.338
Note: Table is continued on the next page.
Table 3: Wiley-Wiley reliability estimates for feeling thermometers from the 2000-2002-2004 ANES
Panel Survey (Continued ).
Panel wave
2000
2002
2004
Hispanics
0.418
0.376
0.368
Asian Americans
0.426
0.419
0.413
Catholics
0.463
0.491
0.545
Jews
0.536
0.511
0.512
Protestants
0.492
0.502
0.508
Labor Unions
0.692
0.627
0.678
Big Business
0.530
0.489
0.567
Poor People
0.486
0.473
0.497
People on Welfare
0.568
0.523
0.512
Fundamentalists
0.715
0.700
0.745
The Elderly
0.410
0.470
0.432
Environmentalists
0.629
0.632
0.630
Gays and Lesbians
0.742
0.705
0.740
Feminists
0.561
0.510
0.571
Table 4: Wiley-Wiley reliability estimates for party identification index and questions about increasing or decreasing federal spending from the 2000-2002-2004 ANES Panel Survey.
Panel wave
2000
2002
2004
Party identification
0.898
0.900
0.912
Fed spending, welfare
0.460
0.476
0.397
Fed spending, fighting AIDS
0.593
0.529
0.476
Fed spending, foreign aid
0.487
0.522
0.514
Fed spending, helping the poor
0.526
0.405
0.263
Fed spending, Social Security
0.647
0.597
0.587
Fed spending, protecting environment
0.583
0.558
0.446
Fed spending, public schools
0.542
0.469
0.500
Fed spending, reducing crime
0.342
0.400
0.305
Fed spending, helping children
0.624
0.546
0.503
Fed spending, assistance to Blacks
0.699
0.689
0.597
Table 5: Predicting the reliability of the feeling thermometer variables from the characteristics
of the objects being evaluated with each thermometer. Table entries are OLS regression
coefficients with standard errors in parentheses. Standard errors are clustered by stimulus
objects. All coefficients are statistically different from zero at the 0.05 level, with one-sided
tests.
Data source
1980 ANES Major
Panel Study
2000-2002-2004
ANES Panel Study
-(1/Polarization)
79.578
(28.326)
164.297
(14.224)
Opinionation
0.581
(0.204)
0.530
(0.117)
Consensus
-0.032
(0.016)
-0.016
(0.004)
George H. W. Bush
0.246
(0.024)
—
Intercept
0.511
(0.178)
0.638
(0.099)
R2
0.739
0.834
Se
0.060
0.059
N of observations
45
87
N of clusters
15
29
Figure 1: Wiley-Wiley reliability estimates for feeling thermometers from the 1980 ANES Major
Panel Study.
Reliability, Jan. 1980
●
Reliability, June 1980
Reliability, Sept. 1980
George Bush
●
Edward Kennedy
●
Jimmy Carter
●
John Connally
●
Ronald Reagan
●
Jerry Brown
●
George McGovern
●
George Wallace
●
Democratic Party
●
Republican Party
●
Gerald Ford
●
Walter Mondale
●
Howard Baker
●
Independents
●
Parties in General
●
0.0
0.2
0.4
0.6
0.8
Wiley−Wiley Reliability Estimates
1.0
Figure 2: Wiley-Wiley reliability estimates for party identification index, ideological self-placement
variable, seven-point issue scales from the 1980 ANES Major Panel Study.
Reliability, Jan 1980
●
Reliability, June 1980
Reliability, Sept 1980
Party ID
●
Def spending
●
lib−con ID
●
Gov spending
●
Inflation
●
Russia
●
0.0
0.2
0.4
0.6
0.8
Wiley−Wiley Reliability Estimates
1.0
Figure 3: Wiley-Wiley reliability estimates for feeling thermometers from the 2000-2002-2004 ANES
Panel Survey (Continued ).
Reliability, 2000
●
Reliability, 2002
Reliability, 2004
Dick Cheney
Hillary Clinton
George W. Bush
Bill Clinton
Al Gore
Jesse Jackson
Ralph Nader
Fundamentalists
Gays and Lesbians
Conservatives
Labor Unions
Federal Government
Liberals
Environmentalists
Military
Feminists
People on Welfare
Big Business
Congress
Catholics
Jews
Protestants
Poor People
Supreme Court
The Elderly
Blacks
Asian Americans
Hispanics
Whites
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
0.2
0.4
0.6
0.8
Wiley−Wiley Reliability Estimates
1.0
Figure 4: Wiley-Wiley reliability estimates for party identification index and questions about increasing or decreasing federal spending from the 2000-2002-2004 ANES Panel Survey.
Reliability, 2000
●
Reliability, 2002
Reliability, 2004
Party ID
●
Fed spending, Blacks
●
Fed spending, Soc Secur
●
Fed spending, children
●
Fed spending, AIDS
●
Fed spending, environ
●
Fed spending, schools
●
Fed spending, help poor
●
Fed spending, for aid
●
Fed spending, welfare
●
Fed spending, crime
●
0.0
0.2
0.4
0.6
0.8
Wiley−Wiley Reliability Estimates
1.0
Figure 5: Effect plots for the independent variables predicting feeling thermometer reliabilities in
the 1980 ANES Major Panel Study.
●
●
●
0.9
Wiley−Wiley reliability for feeling thermometer
●
●
●
●
0.8
●
●
●
●
●
0.7
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
0.6
●●
●
●
0.5
0.4
●
400
600
800
Polarization
●
●
●
0.9
Wiley−Wiley reliability for feeling thermometer
●
●
0.8
●
●
●
●
●
●
0.7
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.6
●
●
●
●
0.5
0.4
●
0.6
0.7
0.8
0.9
Opinionation
●
●
●
0.9
Wiley−Wiley reliability for feeling thermometer
●
● ●
●
●
●
0.8
●
●
● ●●
●
●
●
●
●
● ●
0.7
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
0.6
●
●
●
●
0.5
0.4
●
0
1
2
3
Consensus
4
5
Figure 6: Effect plots for the independent variables predicting feeling thermometer reliabilities in
the 2000-2002-2004 ANES Panel Study.
0.9
●
Wiley−Wiley reliability for feeling thermometer
●
●
●
●●
0.8
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
0.7
●
●
●
●
● ●
●
●
●
●
●
●
●
● ●●
●
●
0.6
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
0.5
●
●
●
●●
●
●
0.4
●
●
●
●
●
●
●
400
600
800
1000
Polarization
0.9
●
Wiley−Wiley reliability for feeling thermometer
●
●
●
●
●
0.8
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
0.7
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
0.6
● ●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
0.5
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
0.4
●
●
●
●
●
●
●
●
●
0.6
0.7
0.8
0.9
Opinionation
0.9
●
Wiley−Wiley reliability for feeling thermometer
●
●
●●
0.8
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
0.7
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.6
●●●
●
●
●
●
●
●
●
●
●
0.5
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
0.4
●
●
●
●
●
●
0
2
●
●
●
●
●
●
●
4
6
Consensus
8
10