Abstract
Multiple-item indexes are ubiquitous within the sociology of religion. However, there are a growing number of articles in other disciplines that have advocated the use of single-item measures in specific circumstances. Using quantitative survey data taken from the United Kingdom, this article contributes to this literature by exploring the impact of single and multiple item measures of religious evil on a series of social and political attitudes. The findings suggest that belief in the devil is the most consistent predictor within a multiple-item measure of religious evil and the multiple-item measure does not significantly outperform single-item measures. Indeed, the item “most evil in the world is caused by the devil” could be a more efficient measure of religious evil, particularly where it is combined with religious attendance. While further cross-cultural research on the impact of belief in religious evil remains necessary, the article also finds some evidence to suggest that exploration of more secular beliefs in evil might be advantageous.
Introduction
It is somewhat of a truism within survey research to suggest that multiple-item measures (MIMs) are more reliable and more valid than single-item measures (SIMs). There are at least three arguments typically made in favor of MIMs. The first is that an assessment of internal reliability cannot be made with a single item. Second, multiple items are less vulnerable to random measurement error, with single items often viewed as being susceptible to biases in meaning and interpretation. Thirdly, multiple items can capture a greater range of dimensions associated with a construct than a single measure, which are presumed to have lower content validity.
However, there are several good reasons to use single items in specific circumstances. From a theoretical point of view, some measures can be considered “doubly concrete” in that they are unambiguous in meaning and attribution (Bergkvist and Rossiter 2007). That is to say that some concepts tend toward binary in terms of interpretation and feeling. Not everything is a multifaceted concept and, in some instances, increasing measurement may contribute to additional response error. Similarly, there are practical advantages to single-item measures, particularly in terms of cost and attrition. This is especially the case for survey research where the accuracy and efficiency of survey items can be at a premium. Not only are single-item measures less monotonous and less time-consuming for respondents, they are also more cost-effective for researchers and funders. Single-item measures may also reduce “method-effects” by reducing the chance of common method bias (Gardner et al. 1998). This typically occurs when relationships between observations occur due to the format of a survey battery rather than the content of items.
These advantages have been noted in many disciplines. Within the field of organizational sciences, for example, Matthews, Pineault, and Hong (2022) recently reported a series of studies that collectively demonstrated that 75 out of the 91 single-item constructs they examined had good or extensive validity in comparison to their multiple-item counterparts. They concluded that single-item measures could: evidence moderate to high content validity; have no usability concerns; demonstrate moderate to high test-retest reliability; and, have extensive criterion validity. Elsewhere, Bergkvist (2016) has made very strong and repeated arguments for the use of single-item measures in marketing research (Bergkvist 2015; Bergkvist and Rossiter 2007, 2009), and there are also an increasing number of papers in the psychological sciences that have demonstrated the utility of single-item measures (Allen, Iliescu, and Greiff 2022).
Given the existing research in other disciplines, in this article we highlight the differences between single-item measures and multiple-item measures and implement methods that can be used to compare their effectiveness. Although many concepts in the sociology of religion are measured by combining multiple items, for this article we chose to focus on the different approaches that people have taken to measure belief in religious evil. There is a wealth of evidence to demonstrate the association between belief in religious evil and a variety of political opinions and moral attitudes. In most of these studies, belief in religious evil is assessed with a multi-item measure combining three items: belief in the devil, belief in hell, and belief in demons. However, more recent evidence provided by Desmond, Clark, and Bader (2023) suggests that a single measure of belief in the Devil/Satan can be just as useful in predicting attitudes related to abortion, family matters, sexuality, and substance use—particularly when it is assessed in relation to church attendance. As noted by Baker, Molle, and Bader (2020), images of Satan are relatively uniform—Satan is a powerful, supernatural force of evil that is in opposition to what is considered good (i.e., God). Therefore, belief in the devil can be considered a “doubly concrete” measure in that there is relatively little ambiguity in terms of what that belief connotes.
More generally, the interaction between belief in Satan and church attendance also resonates with the literature on what might be termed “the sociology of evil,” which repeatedly argues for the absolute interdependence of good and evil (Alexander 2001; Douglas 2017; Lemert 1997; Wolff 1969). Against Platonic and Augustinian thinking that would imagine evil as absence, for example, Alexander (2001) argues that what is seen to be evil is always in opposition to the pursuit of what is perceived to be good—or how people think the world ought to be. The investigation of one necessarily entails the investigation of the other. In this respect, Desmond et al. (2023) also demonstrate a significant interaction effect between belief in the Devil/Satan (evil) and church attendance (good) for 10 out of 12 moral beliefs. That is to say that religious service attendance has little or no effect on moral beliefs, unless it is accompanied by a belief in the Devil/Satan.
This article further explores the utility of a single-item measure of religious evil—belief in the Devil/Satan. More specifically, if a multiple-item measure of religious evil can discriminate between the relative strength of belief because it has “more information,” it is possible to make three predictions. Firstly, the multiple-item measure should be able to account for a greater proportion of the variance in a criterion measure than each of its individual component parts. Second, each component should have the capacity to make a significant contribution to the variance explained. Thirdly, the multi-item measure should also demonstrate the capacity to explain more variance in a criterion measure than other single-item measures of a similar nature. If the multiple-item measure fails to consistently outperform the single-item measure in these respects, then it may be concluded that the single-item measure is just as effective as its multiple-item counterpart.
Using novel survey data collected in the UK, we provide evidence to suggest: a) belief in the devil appears to be the most consistent predictor within the multi-item measure of religious evil; b) the belief in hell and belief in demons components are limited in terms of their additional utility and may even provide a source of measurement variance; c) when combined with a measure of religious attendance, the multi-item measure does not significantly outperform a single-item measure of belief in the devil. This article presents evidence to suggest that the item “most evil in the world is caused by the devil” could be a more efficient measure of religious evil, particularly where it is combined with religious attendance. Finally, the article offers some evidence to suggest that belief in evil may not necessarily be located entirely within supernatural belief systems.
Single-Item Versus Multiple-Item Measures
There are three main arguments for the use of multiple item measures (MIMs). The first relates to “internal consistency” and suggests that without different components of measurement, it is not possible to estimate reliability. As highlighted by Cho and Kim (2015), Cronbach’s alpha is by far the most used reliability coefficient—with alpha ≥ .7 generally considered to be acceptable. When used in this way, alpha is often tacitly taken to be equivalent to a measure of homogeneity, interrelatedness, or general factor saturation—none of which can be assessed by a single item.
Secondly, given that there is no way to assess the reliability of a single-item measure, there is also no way of telling whether the measurement is subject to more or less random error. It is assumed that increasing the number of items will decrease the likelihood of measurement error. This assumption is commonly based on what is termed “the Spearman-Brown prophecy”—which predicts that measurement errors cancel themselves out as the number of items increase. Both classical test theory and item response theory imply that measurement error decreases as the number of items increases, although the relationship is not necessarily monotonic. While the hard version of such an argument is that the error of a single item is necessarily higher, and the softer version that the reliability of single-item measures remains unknown, the essence of the argument is the same: because single items cannot be compared to any corresponding items, single-item measures are more problematic than multi-item measures.
The third argument concerns the multidimensional nature of concepts and their measurement. This is sometimes referred to as “content validity,” and it has two components: concept and measurement. In terms of concept, single-item measures cannot capture the complexity of social constructs in the same way that multi-item measures are able to because they are necessarily limited by number. In terms of measurement, given single-item measures typically capture “less information”—often a three, five, or seven, point scale—they are also unable to make sensitive distinctions between respondents. There is some verisimilitude in respect to these issues. In respect to concept, there is little doubt that sophisticated social concepts are unlikely to be suitable for single item measurement—and overly complex questions are usually considered bad practice in survey research. With respect to measurement, single-item measures are likely to have fewer response categories and there is little evidence that simply increasing the size of the scale improves measurement by itself (Dawes 2008). Therefore, multi-item measures appear to be a useful way of managing these issues.
While there were a few authors who questioned the orthodoxy of multi-item measures (Gorsuch and McFarland 1972; Gorsuch and McPherson 1989; Ray 1974; Wanous and Reichers 1996), it was not until the early 2000s that a growing number of articles began to advocate for the use of single item measures more strongly (Fuchs and Diamantopoulos 2009; Loo 2002; Rossiter 2002). There are two practical, and two technical, arguments for the use of single-item measures made within this literature. With regards to practicality, single-item measures are administratively expedient in terms of time and effort. This is particularly important in survey research where the financial cost of the survey is proportional to the number of items employed—increase the number of items and the cost of the survey will also often be more costly. In addition to keeping costs down, the extra space that is generated can also be used to include a larger number of constructs. That is to say that single-item measures offer better “value for money” than their multiple-item counterparts.
As highlighted by Allen et al. (2022), there are similar benefits from the perspective of the respondent too. Not only do single-item measures avoid repetitive content and thereby encourage participation, they might also be preferential in circumstances where cognitive or emotional considerations need to be taken into account, or where time is likely to be a factor in response. It is also true that there are ethical obligations that would dictate that researchers should not needlessly waste the time of respondents with lengthy and repetitious questions.
The technical arguments for the use of single-item measures are, perhaps, more convincing from an empirical perspective. The first argument for the use of single-item measures in specific cases concerns what Bergkvist and Rossiter (2007, 2009) term “doubly concrete” constructs. These cases occur where the construct of interest is unambiguous, narrow in scope, and where increasing the number of items increases the risk of construct contamination.
Contamination typically has two sources. The first type of contamination occurs where the dimensionality of a construct is extended beyond its focus to include items that are indirectly associated with the target construct, rather than a direct property of it. The second risk of contamination results from what is often referred to as common method bias. This occurs when variance assumed to result from the construct is attributable to the method of measurement. This risk is thought to be particularly prescient when collecting large amounts of data in a single wave. The longer the survey, the more chance that respondents begin to respond to the demands of structure of the battery or instrument rather than the specific requirement of the item.
“Doubly concrete” constructs are, on the other hand, relatively unambiguous in meaning and attribution. This means that adding items for measurement will not provide further dimensionality to the construct because there is no further dimensionality to add. Therefore, any further specification is likely to increase the likelihood of contamination rather than reduce measurement error.
Similarly, the issue of reliability and/or internal consistency may not be a concern in such cases. This is because there are no facets of a measure to assess beyond the initial object. The need for a reliability coefficient is negligible. Instead, Bergkvist and Rossiter (2007) suggest using the predictive validity of a criterion to evaluate the performance of single-item measures and multi-item measures. They propose three reasons for this. Firstly, a measure cannot be predictively valid without being reliable (Gorsuch and McFarland 1972); secondly, predictive validity is a more robust measure than content validity; and thirdly, predictive validity is more important than internal consistency (Cho and Kim 2015; Cronbach 1951). If a precise measure is at least comparable in performance with a multiple counterpart, then there is a strong case for the single item approach to be retained.
None of this is to say that single-item measures are automatically better than multi-item measures, but instead the discussion offered here highlights that there are situations where single items are likely to increase precision rather than necessarily decrease it. One such situation may occur in the measurement of religious evil.
The Measurement of Religious Evil
There is now an emerging body of empirical literature in the USA that has sought to examine the extent of belief in religious evil, and its potential impact on several moral and social issues. Both Grasmick and McGill (1994) and Leiber and Woodrick (1997) used belief in the devil as part of a multi-item measure designed to assess biblical literalism and its various impacts on punitive ideology. Wilcox, Linzey, and Jelen (1991) also developed a four item multi-item measure to examine belief in the threat of Satan to the US (alpha = .69) and used it to explore the political activism of premillennialists—with belief in an active Devil increasing positive attitudes toward activism but decreasing the perceived efficacy of that participation.
Using an eight item “Belief in an Active Satan Scale” (alpha = .81), Wilson and Huff (2001) first demonstrated that belief in religious evil was correlated with negative attitudes about ethnic and sexual minorities. They would later revise the scale to include 10 items (alpha = .91) and demonstrate that such intolerance extended to communists and women, although the relationship was moderated by religious fundamentalism and right-wing authoritarianism (Wilson, Accord, and Bernas 2006).
As highlighted by Baker (2008), while these early studies used aspects of religious evil to categorize particular types of (Christian) religious belief, the work of Wilson and Huff (2001) established that it was possible to measure belief in religious evil, and that such beliefs could influence social and political attitudes. However, it was the first wave of the Baylor Religion Survey (BRS) in 2005 that provided the platform for a more purposeful examination of religious evil (Bader, Mencken, and Froese 2007; Bader et al. 2023). The survey was a nationally representative examination of religious values, practices, and behaviors present in the US population. It contained nearly 400 items devoted to the sociology of religion (Bader et al. 2007). While the early studies of religious evil were relatively idiosyncratic in terms of item construction, the Baylor Religion Survey included a standard battery addressing general religious beliefs: “In your opinion does each of the following exist”—“The Devil/Satan,” “Heaven,” “Hell,” “Purgatory,” “Armageddon,” “Angels,” “Demons,” and “The Rapture” (“Absolutely,” “Probably,” “Probably not,” “Absolutely not”). This question set was repeated in subsequent editions of the survey taken in 2007, 2010, and 2014.
Taking three of these items—the Devil/Satan, hell, and demons, Baker (2008) demonstrated the association between each individual item and a range of demographic variables. As income and education levels increase, belief in each item decreases, with women and African Americans tending to hold firmer belief in the Devil, hell, and demons. However, using an additive index of the three items, Baker’s (2008) article also demonstrated that for women belief in religious evil reflected higher rates of religiosity more generally. This did not appear to be the case for African Americans. Similarly, while younger Americans also tend to hold stronger beliefs when controlling for religion, the effect of income and education is attenuated by church attendance—with high levels of attendance neutralizing the impact of social status.
Baker’s (2008) article made a crucial contribution because it not only demonstrated that belief in religious evil could be studied empirically, but perhaps more importantly, that it should be studied using robust national level data. However, in providing an empirical demonstration of the importance of studying the impact of religious evil, the article also tacitly provided a blueprint for how to study it—and that method utilized a three-item multiple measure.
Several studies have subsequently used the three-item multi-item measure that combines belief in the Devil/Satan, demons, and hell (Ellison et al. 2021; Martinez 2013; Martinez et al. 2018). Baker et al. (2020), for example, used the three-item MIM to demonstrate a relationship between belief in religious evil and a range of attitudes associated with sexuality, including abortion, same-sex relations, pre-marital sex, extra-marital sex, pornography, and cohabitation. Elsewhere, Martinez, Tom, and Baker (2022) used the three-item measure to show that belief in religious evil is a strong predictor of support for more restrictive immigration policies, while Ellison et al. (2021) demonstrated how the multi-item measure is a robust predictor of support for policies that expand gun rights. Jung (2020) used a two-item index (the Devil/Satan and demon items) to demonstrate that belief in religious evil is associated with higher levels of anxiety and paranoia, with Baker and Booth (2016) also adapted the index slightly by combining beliefs about the existence of Satan and hell with an item that measures agreement with whether “Satan causes most evil in the world.”
However, this emergent orthodoxy in the measurement of religious evil has recently been brought into question. Baker et al. (2020), for example, highlight that beliefs about Satan are much less nuanced than beliefs about God—where there is an abundance of research that has examined how different perceptions of God variously influence moral and political attributions (Bader et al. 2017; Froese and Bader 2010). While the historical development of the Devil is complex (Cohn 1993; Messadié 1996; Pagels 1996; Russell 1981, 1992), beliefs about the Devil/Satan are likely to be much more uniform across religious denominations and believers and non-believers—with the Devil being the personification of all that is in opposition to what is perceived to be good, regardless of differences in whatever good that may actually represent.
To this end, Desmond et al. (2023) have demonstrated that a single item measure of belief in Satan can be used as a robust predictor in a variety of moral domains, even when controlling for religious service attendance, biblical literalism, and images of God. This includes the “wrongness” of abortion, marital affairs, homosexuality, premarital sex, pregnancy out of wedlock, and stem cell research—with the three-item measure only significantly related in a further two arenas (cohabitation and pornography). Use of the three-item measure also did not particularly improve the adjusted R2 score, with an increase of less than 1% in 11 out of 12 domains. Similarly, the interaction effect between religious service attendance and belief in Satan was significant for 10 of the 12 moral beliefs, with higher levels of religious attendance having little or no effect when people do not also believe in Satan.
The Sociology of Evil
In addition to demonstrating the utility of a single item measure of religious evil, the results presented by Desmond et al. (2023) also suggest that where ritual is not accompanied by belief, or where belief is unsupported by ritual, then the relative influence of judgments about (deviant) morality is suppressed. This finding resonates with the more theoretical literature on what might be termed “the sociology of evil,” where several scholars have highlighted the absolute interdependence between the perception of good and the perception of evil (Alexander 2001; Douglas 2017; Lemert 1997; Wolff 1969). Alexander (2001), for example, argues that evil cannot be understood as a residual category of what is seen to be “good.” Instead, our ideas of evil—what it is, where to find it, and how to deal with it—provide the symbolic building blocks of a moral society. For Alexander, narratives of evil are the conceptual gloss on “social efforts to symbolize, narrate, code, and ritualize the good” (Alexander 2001:156). The identification and management of evil provides the contrast that is necessary to imagine how we think the world ought to be and an explanation for why it might not be. Therefore, the sociological analysis of evil can help to “reveal the skeletal structures upon which social communities build the stories that guide their everyday taken-for-granted political life” (Alexander 2001:166).
Similarly, Douglas (2017) notes that judgments about good and evil are categorical distinctions between fundamental absolutes. That is to say that good and evil cannot be considered to exist on a moral continuum because that is not how they are understood within everyday life. It might be possible to make an assessment about whether one evil is somehow worse than another, but neither can be considered good: “good necessarily implies a categorical contrast; if there is a good type there must be an evil type”Douglas (2017:5). Although Douglas (2017) is much less prescriptive about the exact construction of what constitutes evil than Alexander’s (2001) consensus-based model, in both cases the contrast between an interdependent binary—good and evil—provides the framework through which moral belief and behavior can be seen to exist.
In these terms, belief in the Devil/Satan (evil) provides the necessary contrast for religious attendance (good) to be morally meaningful, and vice versa—hence the relative strength of the interaction term. The two opposing forces provide the categorical distinction for both to be meaningful.
However, while the findings of Desmond et al. (2023) appear to confirm some of the theoretical conceits presented within the sociological literature on evil, it remains to be seen whether those results can be applied in contexts outside of the USA. Indeed, while most of the work on religious evil is based on North American data, there is some evidence to warrant further investigation elsewhere. Sigelman (1977), for example, notes that the USA had higher rates of belief in the Devil and hell than other countries, while Baker and Booth (2016) also provide some tentative evidence using the World Values Survey (2010-2014) to suggest that belief in the Devil in the USA may be twice as high as other developed countries—although they also note that these differences require further elaboration. To these ends, it is not clear whether and how the relationship between religious attendance, belief in the Devil/Satan, and morality resonates beyond the context of the USA.
Aims and Objectives
This article aims to address five key research questions. First, does the multiple item measurement of religious evil account for a greater proportion of the variance in a criterion measure of moral belief than each of its component elements? Second, do each of the components make a significant addition to the measure? Third, to what extent is the measurement of religious evil invariant across religious affiliation? Fourth, when used in conjunction with religious attendance, does the multiple item measurement of religious evil outperform single items that are designed to assess belief in the Devil/Satan? Finally, do the relationships between religious attendance, belief in the Devil/Satan, and morality resonate in contexts beyond the USA, in UK.
Methods
Data
This article draws on data taken from CASPAR: the Chapman and Sheffield Paranormal and Religion survey (Bader, Baker, and Clark 2024). Survey respondents consisted of a random, national sample of 2,100 UK citizens, and was conducted by Ipsos Mori using their web-based panel. The panel is composed from random probability unclustered address-based sampling and has a total of over 25,000 panelists. In August 2021, panelists were invited by email to complete the survey, with design and calibration weights then applied by the provider to correct for differences in selection probabilities and response rates between subgroups.
Survey Instruments
The survey was designed to assess religious and alternative beliefs and experiences, while also asking respondents to respond to batteries of items on moral issues, political opinion and social issues. Items were specifically designed to resonate with measures used previously in the literature, although some adjustments were made to suit house styles and local context.
In the first instance, the analysis combines a set of indicators that measure belief in religious evil. Responding on a four-point scale (“definitely,” “probably,” “probably not,” “definitely not”), participants were asked “in your opinion, does each of the following exist?” This included—“The Devil/Satan,” “Hell,” and “Demons”—with items subsequently recoded so that a high value represents a high level of belief.
A further battery of “evil” items was adapted from a range of surveys discovered using “the measurement wizard” tool available through the Association of Religious Data Archives. This included: “Most evil in the world is caused by the Devil” (hereafter “Devil”); “Most evil in the world is caused by humankind” (“Humankind”); “Satan is the root of all evil” (“Satan_Root”); and “Most evil in the world is caused by malignant supernatural forces” (“Supernatural forces”). These items were measured on a five-point scale: “strongly agree,” “tend to agree,” “neither agree nor disagree,” “tend to disagree,” “strongly disagree.” At the insistence of the provider, all items within the survey were accompanied by the option “prefer not to say.” All of these returns were coded as missing.
We control for a range of demographic variables within the analyses. This includes: gender (men = 1); age (16–24, 25–34, 35–44, 45–54, 55–64, 65–75); ethnicity (white = 1); marital status (married = 1); and, whether or not the respondent has children under 18 living at home (yes = 1). Education was assessed using a seven-point category system, ranging from primary school through to post-graduate levels. Income is measured on a 10-point ordinal scale (Under £5,000, £5,000–£9,999, £10,000–£14,999, £15,000–£19,999, £20,000–£24,999, £25,000–£34,999, £35,000–£44,999, £45,000–£54,999, £55,000–£99,999, £100,000 or more). England is used as the reference category for the region of the UK, with dummy variables for Wales, Scotland, and Northern Ireland.
Religious affiliation was assessed using the provider’s standard item—“What is your religion, if any?”—followed by a 17-fold classification system. To simplify the analysis, this was subsequently recoded to include: Catholic (n = 211), Church of England (n = 527), Protestant (n = 157), Muslim (n = 88), Other (n = 92), Agnostic (n = 276), and Atheist (n = 441).
Religious attendance was also measured with the standard item—“How often do you attend religious services at a church, mosque, synagogue or other place of worship?”—with responses measured on an eight-point scale (never; less than once a year; once or twice a year; several times a year; once a month; two to three times a month; weekly; several times a week).
We also used eight bespoke items within the survey that were directed toward the assessment of social issues related to moral beliefs. Each of the items used the same five-point response format: “strongly agree,” “tend to agree,” “neither agree nor disagree,” “tend to disagree,” and “strongly disagree.” Within the battery, respondents were first asked “To what extent do you support or oppose each of the following things? Please select one answer per statement.” Responses were subsequently combined into four groups of two statements, with higher scores indicating support: “Congestion charges on busy roads across the UK” (Transportation issues) and “Tightening the laws in the UK to help protect the environment” (Environmental issues); “Greater funding of the National Health Service” and “Increases in the national minimum wage in the UK” (Welfare rights); “Equal rights for transgender people in the UK” and “Same-sex marriage” (Sexual freedoms); and, “The death penalty for persons convicted of murder” and “Harsher sentences than there currently are in the UK for convicted criminals” (Punitiveness).
Analytic Strategy
Bergkvist and Rossiter (2007) suggest two principal techniques to examine the effectiveness of a multiple item measure—and both aim to establish the predictive validity of a measure through its relative relationship with a criterion variable. In respect to assessing the composition of multiple item measurements, it is possible to break down the items of a multiple-item scale and enter them as independent predictors in a regression on the criterion variable. The central item is entered first, and then each additional item in turn. If those additions provide a significant increase to the adjusted R2, then there is evidence that the construct is suitable for multidimensional measurement (research question one). Differences between the models can be further assessed with a partial F statistic. Given that the measurement of religious evil is usually composed of three distinct components (Devil, Hell, and Demons), then it would be expected that each should make an additive contribution to the criterion measure (research question two).
Given that the three-item scale purports to measure religious evil, we might also expect that there is no appreciable difference in the performance of the scale across religious affiliation (research question three). This is sometimes referred to as measurement invariance. While there is discussion of appropriate “cut-offs,” Item Response Theory can be used to assess both item fit and Differential Item Functioning (DIF) (see Engelhard and Wang 2021). Item fit is an assessment of the suitability of an item for use in the model. The most robust assessments of item fit using a Rasch model require infit and outfit statistics of between .7 and 1.3 (Aryadoust, Ng, and Sayama 2021), with violations of infit thought to be more problematic than outfit violations (Linacre 2002). On the other hand, Differential Item Functioning occurs when an item within a latent trait varies systematically between groups. Following Pietryka and MacIntosh (2022), it is possible to detect DIF by examining group differences in the standardized residuals using a one-way ANOVA test—with a significant result suggesting the presence of DIF on a particular item. Item contrasts can then be used to identify differences between individual groups.
There are several methods that can be used to establish the utility of alternative single-item measures against their multiple item counterpart. Perhaps the most common is to assess the convergent validity that exists between single and multiple item versions. Unfortunately, there is little agreement on what value represents acceptable convergence. Greiff and Allen (2018) suggest that one approach is to adopt alpha coefficients associated with test-retest reliability, where r ≥ .90 is excellent, ≥.80 is good, and ≥.70 is moderate. However, these cut-offs are only general rules of thumb.
Following Gorsuch and McFarland (1972), it should also be possible to inspect the commonalities and eigenvalues associated with the single-item measures and multi-item measures after performing a principal components analysis. This would give an indication of the extent to which each individual scale had variance in common with the other scales in the analysis. Again, while cutoffs are general rules of thumb, it could be expected that each of the scales would load on to a single factor that explained at least 70 percent of the variance, with commonalities of over .60 for each item (MacCallum et al. 2001), and component loadings of over .71 (Comrey and Lee 2013).
However, both alpha and Principal Component Analysis (PCA) will only confirm the relative similarity of the items. For Bergkvist and Rossiter (2007), the relative influence on a criterion measure is the most effective method of establishing predictive validity. That is to say that if a single-item measure is comparable in performance with a multi-item measure, then there is a strong case for the single item approach to be retained. To this end, it would be possible to accept the single item measure by making three further assessments. Firstly, that there is no appreciable decline in the adjusted R2 score when using the single item measure. Secondly, there is little difference in the composition of the model when controlling for demographic characteristics. Third, given that both the theoretical discussion and empirical evidence suggest an interdependence between good and evil, we would also expect an interaction between religious evil and religious attendance (research question four). These final outputs should be sufficient to examine the usefulness of a single item measure, and the extent to which the relationships between religious attendance, belief in the Devil/Satan, and moral belief exist in the context of the UK (research question five).
Results
The results of the OLS hierarchical regression examining the relative impact of each item of the multiple item measure of religious evil are depicted in Tables 1 and 2. Four moral domains are assessed: welfare support and sexual freedoms feature in Table 1, and environmental issues and punitiveness in Table 2. 1
Regression of Multiple Item Measures of Religious Evil on Moral Beliefs (1).
p < .05. **p < .01. ***p < .001.
Regression of Multiple Item Measures of Religious Evil on Moral Beliefs (2).
p < .05. **p < .01. ***p < .001.
The results presented in Table 1 (columns 1–4) suggest that only the item assessing belief in the Devil/Satan provides a positive and significant increase beyond the controls in support for welfare issues. Columns 5 to 8, however, suggest that while belief in the Devil/Satan is negatively associated with sexual freedoms (support of trans/gay rights), the effect is negated by the inclusion of other variables. Belief in Demons is the only evil measure that remains significant in the final model. The results presented in the first four columns of Table 2 suggest that none of the items on religious evil provide a significant increase to R2—although it is, perhaps, notable that the Hell variable makes a (non-significant) negative contribution to the final model.
While belief in the Devil/Satan is an improvement on the null model, only belief in Demons makes a significant contribution to a final model predicting greater support for harsher punitive policy (columns five to eight). Across all the final models, no significant increase in adjusted R2 is reported for the item measuring belief in Hell, while the performance of the Devil/Satan and Demons items is variable. Therefore, within the context of the UK, there is some evidence to suggest that these items might not be suitable for multidimensional measurement (research question one), because they can act in different ways when regressed on criterion measures (research question two).
To assess measurement variance using Item Response Theory, we used the TAM package in R (Robitzsch, Kiefer, and Wu 2024) and followed the method for polytomous data outlined by Pietryka and MacIntosh (2022). Item fit statistics for the variables within the religious evil multi-item measure are presented in Table 3. As previously suggested, the most robust assessments require infit and outfit statistics of between .7 and 1.3. Further, there is a general acceptance within the literature that large sample sizes (>n = 500) will tend to bias t-statistics for polytomous data, whereas mean square statistics have been shown to be relatively stable (Smith et al. 2008). Given that the sample size in this instance is large, we discount the significance levels associated with outfit and infit measures. The outfit score for the Satan variable does not meet the robust cut-off, but this is marginal and unlikely to be detrimental in “low stakes” measurements. All the items score below 1 and therefore fit the model better than might otherwise be expected.
Religious Evil: Item Fit Statistics.
p < .01. ***p < .001.
However, inspection of Table 4 reveals evidence of Differential Item Functioning (DIF) for all the items in relation to Religious Affiliation. To be clear, there is some recognition in the literature that DIF is context dependent, and there is difference between benign and adverse DIF. Benign DIF is generally not considered to be detrimental to the scale, while adverse DIF is typically attributed to measurement error. As suggested by Gierl et al. (2001:167): “When conducting DIF analyses, it is a matter of judgment as to whether the secondary dimension is interpreted as benign or adverse in a particular testing situation … the purpose of the test, the nature of the secondary dimension, and the examinees of interest must be considered.”
Religious Evil: Differential Item Functioning by Religious Affiliation.
p < .001.
In this case, it is possible to make at least two clear points in relation to the argument that the Differential Item Functioning is adverse. Examination of Table 5 suggests that there are no significant differences in item functioning on the Demon item between those claiming Catholic, Protestant, and CoE affiliation and those claiming to be agnostic. This is not the case for either beliefs in Satan or Hell where significant differences are reported, and perhaps would be expected. This means that where latent trait scores are the same for individuals, there are no systematic differences across the groups on the Demon item. This would seem to compromise the argument that the Demons item is a distinguishing feature of a specifically religious evil.
Item Contrasts by Religious Affiliation.
p < .05. **p < .01. ***p < .001.
Perhaps more problematically, Table 5 also suggests evidence of a consistent difference between Muslims and other forms of affiliation in respect to the Hell item. Again, this is to suggest that where individuals in these groups correspond on the scale overall, they will likely differ in respect to their belief in Hell—with Muslims more likely to score higher on the item.
As a result, there is evidence to suggest that the items within the religious evil multi-item measure are not invariant across religious affiliations (research question three). That is to say that belief in the Devil/Satan, belief in Hell and belief in Demons operate in different ways across religious groups.
To assess the convergent reliability of the single item measures, Table 6 provides a correlation matrix between all individual measures and the multiple item counterpart. All correlations are statistically significant (p < .01). The alpha coefficient between these items is .681 (n = 1768) and can be considered moderate. However, it is notable that the correlations between the “Most evil in the world is caused by humankind” (Humankind) variable and every other measure of religious evil are negative. If the Humankind variable is removed, the alpha coefficient also rises to .794. These results suggest that the religious evil index, “Satan is the root of all evil” (Satan_Root), “Most evil in the world is caused by the Devil” (Most evil … Devil), and “Most evil in the world is caused by malignant supernatural forces” (Supernatural Forces) items are similar in nature, but different to the “Humankind” measure.
Correlation Matrix of Evil Items.
p < .001.
Principal components analysis (PCA) also reveals communalities over .60 for each item, except the “Humankind” item, which achieved .11 and was therefore removed from the analysis. With an eigenvalue of 2.90, the results of the PCA suggest a single factor solution that explains 72.4% of the variance, with each component loading above .80 (see Table 7). While the multi-item measure, “Most evil … Devil,” “Satan-Root,” and “Supernatural forces” items appear suitable for comparison, the item assessing “Humankind” may operate in a different way from the other items, as it does not load on to the single factor identified in the solution. However, we include it in the remaining analyses to continue to assess its performance in relation to moral beliefs.
PCA Component Loadings.
Tables 8 to 11 show the results of the Ordinary Least Squares (OLS) hierarchical regressions examining the relative performance of each measure of each evil item in combination with religious attendance measure. Again, regression coefficients are used to assess the relative impact of the independent measures on the criterion measures—environmental issues, welfare issues, sexual freedoms, and punitiveness. 2
Comparative Performance of Items by Support for Environmental Issues.
p < .05, **p < .01. ***p < .001.
Comparative Performance of Items by Support for Welfare.
p < .05. **p < .01. ***p < .001.
Comparative Performance of Items by Support for Sexual Freedoms.
p < .05. **p < .01. ***p < .001.
Comparative Performance of Items by Support for Harsher Punitive Policy.
p < .05. **p < .01. ***p < .001.
In respect to environmental issues, Table 8 demonstrates that while none of the items measuring religious evil achieve significance individually, when used in combination with religious attendance, the interactive terms for the “Most evil in the world is caused by the Devil” item is the only measure to make significant contributions to the model—with the final model explaining 6 percent more of the variance in R2 adjusted than the multi-item measure. Therefore, the “Most evil … Devil” variable may be more sensitive than other measures when combined with religious attendance.
Relative performance with respect to support for welfare is assessed in Table 9. Beyond the controls, the “Satan is the root of all evil,” “Most evil … Devil” and “Most evil … Humankind” items achieve significance in themselves. However, whereas the “Most evil … Devil” and “Satan … root” items are negatively associated with welfare support, the “Humankind” item is positive. This again provides further evidence that these items are indicative of substantively different moral structures. Only the interaction terms for the multi-item measure, the single item of belief in the Devil/Satan, and the “Most evil … Devil” variable achieve significance. However, the multi-item measure explains 4 percent more of the variance in R2 adjusted than the “Most evil … Devil” variable, and the single-item measure of belief 10 percent more. While both the single-item measure for belief in the Devil/Satan and the “Most evil … Devil” variable are similarly as sensitive as the multiple-item measure—each determining that people with lower religious attendance who do not believe in the Devil/Satan are more likely to support increased welfare—the single item explains slightly more of the variance.
Table 10 shows the comparative results with respect to attitudes toward sexual freedoms. Again, the “Humankind” item is the only measure to demonstrate a positive association with the criterion measure, with all other measures of religious evil being negative—although only the “Satan … Root,” “Most evil … Devil,” and “Supernatural forces” variables make significant additions to their respective models. However, only the “Most evil … Devil” and “Supernatural forces” items have significant interactions with religious attendance, explaining a total of 23 percent and 22 percent of the variance in adjusted R2 respectively. Therefore, the “Most evil … Devil” item is more sensitive than both the multiple-item measure and the single belief item measure in determining that people with less religious attendance who do not believe in the Devil are more likely to support gay/trans rights.
Finally, Table 11 shows the comparative results in performance on the punitive criterion measure. The “Humankind” item is the only measure that does not demonstrate a significant, positive association with the criterion measure, whereas the model containing “Supernatural forces” is the only model where religious attendance achieves significance. None of the interaction measures are significant. The total variance explained is highest for the “Supernatural forces” and “Satan_Root” items (11.0% and 10.8% respectively), while the multiple item measure (10.1%) marginally outperforms both the single item belief in the Devil/Satan (9.5%) measure and the “Most evil … Devil” item (9.7%). Religious attendance, therefore, does not appear to predict support for harsher punitive policy, although belief in the Devil is a significant predictor in its own right.
Discussion
There is an emerging body of literature across several disciplines to suggest that single-item measures can be similarly as effective as their multiple-item counterparts. At the same time, there is also some emerging evidence to suggest that multiple-item measures of religious evil can effectively be reduced to a single item—belief in the Devil—without significant detrimental impact on explanatory capacity. This article makes an important contribution to this literature by providing further evidence to demonstrate that single-item measurements of religious evil can be similarly as effective as their multiple-item counterparts. While there is some variation in performance across items and between criterion variables, the multiple-item measure for religious evil is neither consistently better in terms of the amount of variance explained nor in terms of achieving significance than other measures. Differences in performance between the item “Most evil in the world is caused by the Devil” and both the single-item measure of belief in the Devil and the multiple-item measure of religious evil are marginal, particularly when used in conjunction with religious attendance. Moreover, there is evidence that the religious evil multi-item measure should not be assumed to be invariant across religious affiliation and may conflate nuanced differences in response. Therefore, these findings offer further support to the idea that belief in the Devil should be considered a “doubly concrete” concept that may be more usefully assessed with a single item. In addition, there is also some evidence offered here that the “Most evil is caused by humankind” item may interrelate with moral beliefs in ways different to beliefs in the Devil. Further exploration of other measures assessing belief in evil may be advantageous.
There are, of course, limitations to the findings. The first concerns the lack of wider cross-cultural evidence pertaining to items assessing religious evil. Alongside Desmond et al. (2023), this article demonstrates that there is evidence of effectiveness of single items in the measurement of religious evil in the UK and the USA. However, while there are certainly differences between the two countries, they do have somewhat intertwined histories and it remains to be seen how items assessing belief in the devil, religious evil, religious attendance, and morality interact within different, wider contexts. Further cross-cultural evidence would be beneficial to help specify the utility of single and multiple-item measures, their relative interaction with religious affiliation and attendance, and how they work in conjunction with social and moral beliefs. It is also not possible to eliminate the possibility of method effects within the measurement of these items. Method effects occur where variation in an item can be explained due to the method of measurement, rather than being a property of the attribute (Schweizer 2020). The measures in this study were delivered in two distinct batteries: the initial belief items (Devil, Hell, Demons), and a battery focused on the measurement of belief in evil. Evidently, this article does report fine-grained distinctions between the performance of these items and this is strongly indicative that there are differences between their relative performance. However, while a general assumption is made within survey research that each item is independent of another, further research using these items in different contexts would help to discount the impact of potential methods effects. Moreover, when used in conjunction with “evil” items, it also remains to be seen whether other items that symbolize the “good”—whether religious belief, practice, or experience—are more sensitive than religious attendance. Indeed, there are well-established discussions concerning the limitations of religious attendance as an indicator of religiosity and it may be that there are more sensitive measures that can be used in conjunction with measures of religious evil (Smith 1998).
Regardless of these limitations, the article makes four important contributions to the literature. In the first instance, there are some trends apparent in the United Kingdom that seem to be congruent with those in North America. If we were to take the “Devil” item as a barometer of the relationship between religious evil and moral belief, there is evidence here to suggest that increased support in respect to environmental issues is related to decreases in religious attendance, particularly when accompanied by little or no belief in the Devil. This appears to also be the case for support for increased welfare support and attitudes toward sexual freedoms; religious attendance and belief that “the Devil is responsible for most evil in the world” suppress support for each, particularly when they occur together. In contrast, support for harsher punitive policy is only associated with higher levels of belief in the Devil. While this finding might appear somewhat different to other criterion measures, the nuanced relationship between religious evil and punitiveness has been reported elsewhere (Baker and Booth 2016). These findings provide support for further investigation of this relationship.
In demonstrating the interaction between belief in the Devil and religious attendance, the article also provides further support for the absolute interdependence between ideas of good and evil. It remains possible to attend religious services and not believe in the Devil. Similarly, it is possible to believe in the Devil and be largely irreligious. However, it is evident that the presence of one supports the meaning and interpretation of the other. Attendance demonstrates a ritual “good,” but that alone does not necessarily have the contrast to propel understandings of those actions into a meaningful moral sphere. Resonating with some long-standing, but largely untested, theoretical concepts in the sociology of evil (Alexander 2001; Douglas 2017), belief in the Devil appears to be the catalyst that, when combined with a ritual “good” in the form of religious attendance, allows moral attributions to be made with greater certainty. That is to say that if a person believes that they are good, and that there is an evil agonist in the world, then there is a greater platform to interpret morality in a binary format. Evidently, belief in greater punitiveness is something of an outlier in this respect where belief in the Devil is sufficient for support. However, and as previously stated, this has not gone unnoticed elsewhere and focused research is likely to be necessary to disentangle these effects. What this article does, however, is further demonstrate the importance of evil in understanding punitive ideology beyond the context of the USA.
Third, while the results in this article suggest that the single-item use of the Devil variable may be beneficial, this does not mean that the items assessing belief in Hell or Demons are unimportant. These variables may also be “doubly concrete” in as much as they assess highly specific beliefs that may also vary by religious affiliation. Such nuances are also likely to be underestimated within approaches that package belief as a general index. Further examination of how these items might vary across affiliation is likely to be particularly valuable.
Finally, this article offers evidence to suggest that further empirical assessment concerning more secular beliefs in evil is necessary. This includes the relative impact on moral beliefs, but also the different forms these beliefs might take. The differential performance of the “most evil in the world is caused by humankind” suggests the possibility that belief in a distinctly human evil is separate from belief in a religious evil, however it is measured. In the context of a secular society where evil is often associated with serial killers, child murderers, terrorists, and pedophiles, this might not be too surprising. It has, however, gone largely unexplored within the literature. Of course, Augustinian metaphysics attempted to resolve the Epicurean paradox by placing the cause of evil within humankind and it would not necessarily be contradictory to believe that “the Devil is the cause of most evil in the world” and “most evil in the world is caused by humankind” (Evans 1982). That said, and unlike the “Most evil … Devil” item, the evidence in this article suggests that those who believe in human evil are more likely to support welfare issues and greater sexual freedoms, and although there is a non-significant relationship between belief in human evil and environmental issues, it is notable that the relationship is a positive one. To put this more simply, the belief that most evil in the world is caused by the Devil appears to lead to different outcomes than the belief that “most evil is caused by humankind.” Further investigation is, therefore, necessary to explore the nature, form, and influence of more secular beliefs in evil. Indeed, the results presented within this article again highlight the importance of further developing robust survey items relating to beliefs of both “good” and “evil” across cross-cultural contexts and their relative influence on moral beliefs.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Ethical Approval
The project received ethical approval from the Chapman University Institutional Review Board (IRB-21-275) and written informed consent was given to the survey provider by participants in all cases.
Data Availability Statement
The data in this study are available upon reasonable request.
