Abstract
In-service and preservice teachers are increasingly required to integrate research results into their classroom practice. However, due to their limited methodological background knowledge, they often cannot evaluate scientific evidence firsthand and instead must trust the sources on which they rely. In two experimental studies, we investigated the amount of this so-called epistemic trustworthiness (dimensions expertise, integrity, and benevolence) that student-teachers ascribe to the authors of texts who present classical research findings (e.g., learning with worked-out examples) that allegedly were written by a practitioner, an expert, or a scientist. Results from the first exploratory study suggest that student-teachers view scientists as “smart but evil,” since they rate them as having substantially more expertise than practitioners, while also being less benevolent and lacking in integrity. Moreover, results from the exploratory study suggest that evaluativistic epistemic beliefs (beliefs about the nature of knowledge) predict epistemic trustworthiness. A preregistered conceptual replication study (Study 2) provided more evidence for the “smart but evil” stereotype. Further directions of research as well as implications for practice are discussed.
Introduction
Teachers all over the world are encouraged to integrate insights from educational research into their everyday practice (Bauer & Prenzel, 2012; Slavin, 2002; Williams & Coles, 2007). However, given their limited knowledge about research methodology, they often cannot evaluate these insights firsthand and must consult secondhand evaluations by asking “Whom do I believe?” instead of “What is true?” (Bromme, Thomm, & Wolf, 2015).
Hence, features such as study design or sample representativeness might be less important when teachers evaluate knowledge claims, whereas criteria such as perceived author expertise or integrity (so-called epistemic trustworthiness; Hendriks, Kienhues, & Bromme, 2015) become pivotal. Research into the predictors of epistemic trustworthiness, however, is still in its infancy (for the first experimental attempts, see, e.g., Hendriks, Kienhues, & Bromme, 2016a; Thon & Jucks, 2017). Do teachers and student-teachers see certain sources as more credible than others? Does the trustworthiness they ascribe to certain sources depend on their beliefs regarding the nature of scientific knowledge (Hofer & Bendixen, 2012; so-called epistemic beliefs; Hofer & Pintrich, 1997)? In view of this research gap and considering that such predictors may provide valuable insights into how we should approach teacher education, we investigated student-teachers’ perceived epistemic trustworthiness of educational researchers and how it relates to their epistemic beliefs in an exploratory pilot study and a preregistered main study. Before we describe the two studies in detail, we will provide some background information about the constructs they entail.
Epistemic Trustworthiness and Epistemic Beliefs
Epistemic Trustworthiness
In-service teachers and teacher education students are confronted with vast amounts of information about teaching that stem from a multitude of information sources. For example, they may read an expert blog that introduces a new digital classroom tool, consult their colleagues’ opinions on a certain teaching method, or skim a newspaper article on school reforms. Moreover, in line with current calls for more evidence-based practice in education (e.g., Munthe & Rogne, 2015), in-service and student-teachers increasingly are required to inform themselves using science-based information sources (e.g., empirical studies or scientific textbooks).
Before using any such information for their everyday practice, it is vital that teachers evaluate its veracity through its logical coherence and cohesiveness (Bromme, Kienhues, Porsch, Bendixen, & Feucht, 2010). Regarding scientific knowledge, however, two aspects make this endeavor particularly challenging (Hendriks, Kienhues, & Bromme, 2016b). First, since it relies on axioms and only offers “degrees of confirmation” (Popper, 1954), scientific knowledge always encompasses some degree of (epistemic) uncertainty (Retzbach, Otto, & Maier, 2016; Sinatra, Kienhues, & Hofer, 2014). Hence, finding out what is “true” is not as easy as it might seem. Second, modern science is highly specialized and has developed a “social infrastructure of knowledge in which there are divisions of cognitive labor and sophisticated mechanisms for recognizing appropriate experts and knowing when and how to defer to them” (Keil, 2010, p. 828). Due to this division of cognitive labor and since teachers (and student-teachers) usually do not have much research expertise, firsthand (i.e., direct) evaluations of the veracity of scientific knowledge are often unfeasible. Therefore, secondhand evaluations (Bromme et al., 2010) come into play, as individuals no longer directly assess the veracity of knowledge claims, but they assess the credibility and trustworthiness of the sources from which this knowledge originates.
When individuals assess the trustworthiness of different sources, they refer to specific source information features (Stadtler, Scharrer, Macedo-Rouet, Rouet, & Bromme, 2016). In this regard, the professional background of the source is of particular importance, as it helps individuals “decide whether the source possesses expertise that is pertinent to his or her current problem” (Stadtler et al., 2016, p. 709). A caveat when individuals use such source information, however, is that biased prior beliefs on specific sources may affect their information behavior considerably, to a point at which they refrain from using specific source types (e.g., science-based sources) altogether. Regarding teacher education, this is an important question for everyone aiming to promote evidence-based practice: If teachers mistrust science-based information, they likely will not use such knowledge in their teaching and instead rely on experiential and anecdotal evidence. This may, in turn, affect their teaching considerably, especially considering that in the education domain, a multitude of other readily available sources are available—for example, colleagues’ knowledge and expertise or personal experiences (Buehl & Fives, 2009). Therefore, we see it as crucial to investigate the “epistemic trust” that (student) teachers attribute to scientific sources (e.g., findings from educational studies)—especially in contrast with their trust in nonscientific findings (e.g., teachers).
Several studies from the realm of science communication (Cummings, 2014; Hendriks et al., 2015; Peters, Covello, & McCallum, 1997) and other fields (e.g., Landrum, Mills, & Johnston, 2013; Mayer, Davis, & Schoorman, 1995) suggest operationalizing epistemic trust in three dimensions: expertise, benevolence, and integrity, which can be applied to scientific as well as nonscientific sources (e.g., trust in the expertise of teachers vs. educational reseachers). A source exhibits high expertise if it is highly informed, intelligent, and qualified. Benevolent sources are interested in the greater good of others, and sources with integrity respect norms and values in great measure. This conceptualization is particularly fruitful, as it allows for a more fine-grained investigation into the different aspects of epistemic trust in educational science than in studies that analyze the trustworthiness of scientific practices in general (Collins, 2009; Nadelson & Hardy, 2015; Wynne, 2006).
In sum, these deliberations lead us to our first (exploratory) research question:
Research Question 1: What amount of epistemic trust (expertise, benevolence, and integrity) do student-teachers attribute to different sources of educational knowledge?
Extant research from related fields has shown that teacher education students have a rather negative attitude toward scientific knowledge in general, at least when considering educational disciplines; for example, they view general pedagogical knowledge as too abstract and theoretical (Gitlin, Barlow, Burbank, Kauchak, & Stevens, 1999; Sjølie, 2014; van der Linden, Bakx, Ros, Beijaard, & Vermeulen, 2012). While we tended to assume that student-teachers also mistrust scientific knowledge, in line with such research, we did not formulate specific confirmatory hypotheses prior to investigating this research question by means of existing data. Therefore, the present article is divided into a pilot study (Study 1) with an exploratory nature and a main study (Study 2), which was preregistered to ensure confirmatory research (Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012).
Epistemic Beliefs
Epistemic trustworthiness focuses on the expertise, benevolence, and integrity that individuals ascribe to specific sources of knowledge (e.g., scientists or practitioners). Epistemic beliefs, in contrast, encompass explicit and implicit beliefs about knowing and knowledge, and can be defined as “an identifiable set of dimensions of beliefs, organized as theories, progressing in reasonably predictable directions, activated in context, operating as epistemic cognition” (Hofer, 2001, p. 377). This definition already highlights that researchers study epistemic beliefs under various notions and perspectives. In the earlier years of epistemic belief research, researchers primarily adopted a developmental perspective that was pioneered by Perry (1970), who interviewed college students and deduced a scheme to describe the development of their epistemic beliefs. According to Perry (1970), epistemic beliefs develop in nine steps over four developmental stages, beginning with a dualist stage, in which individuals view knowledge as absolute and certain. In the next stage, called multiplism, individuals perceive knowledge as created by the human mind, whereby assertions become more like opinions than facts. Individuals with multiplistic beliefs tend to very similar validity claims regarding different types of arguments (e.g., scientific vs. episodic evidence) by emphasizing that “knowledge is subjective, uncertain, and justified by personal preferences and judgments” (Barzilai & Eshet-Alkalai, 2015). In the two highest stages, relativism and commitment within relativism, individuals accept the tentativeness of scientific knowledge and its origin in the human mind but believe that knowledge is susceptible to evaluation. Individuals with evaluativistic beliefs hence focus on evaluating the validity claims of knowledge assertions instead of neglecting the possibility of this evaluation (multiplistic beliefs) or assuming that knowledge assertions are either true or false (absolutistic beliefs). Perry’s (1970) model was adopted and modified by many researchers (e.g., Greene, Azevedo, & Torney-Purta, 2008; Krettenauer, 2005; Kuhn, Cheney, & Weinstock, 2000), whose models vary in terms of the conceptualization, number, and labeling of respective stages. However, even though the developmental perspective is still used today (Muis, Bendixen, & Haerle, 2006), the assumption that several aspects of epistemic beliefs develop simultaneously when individuals move from one stage to another has been questioned several times (e.g., Hofer & Pintrich, 1997). Over the years, this has led to the so-called dimensional perspective, under which different subdimensions of epistemic beliefs can be distinguished. The most prominent dimensional framework was suggested by Hofer and Pintrich (1997) and includes two dimensions of beliefs about knowledge (simplicity and certainty) and two dimensions of beliefs about knowing (source and justification). However, dimensional frameworks also have been criticized, mainly for their conceptual muddiness in defining the dimensions’ extreme poles. This has led to the development of several integrated frameworks (Barzilai & Eshet-Alkalai, 2015; Greene et al., 2008; Peter, Rosman, Mayer, Leichner, & Krampen, 2016; Rule & Bendixen, 2010), which posit that epistemic beliefs develop within multiple dimensions over diverse stages (e.g., absolutism, multiplism, and evaluativism; Muis et al., 2006).
As for relationships between epistemic beliefs and epistemic trust, reflections of epistemological entities form a condicio sine qua non, at least for some parts of epistemic trust: If one is epistemologically pessimistic (i.e., believing that reality is not accessible to scientists), it does not make sense to attribute high expertise to these scientists. Or vice versa: Assuming that a researcher is competent, benevolent, and has integrity is, by definition, consistent with evaluativistic epistemic beliefs. If scientific knowledge does not consist of arbitrary opinions, but instead is susceptible to evaluations of its assertions through the scientific community, then at least most scientists are likely trustworthy. Hence, we formulated the following second research question:
Research Question 2: Can domain-specific beliefs about educational research predict epistemic trust in sources of assertions from educational research?
For the same reason as outlined above, this research question again was investigated in one exploratory study (Study 1) and one confirmatory, hypothesis-testing study (Study 2).
Present Studies
In this section, we present two studies that investigate the research questions mentioned above and that are repeated here for readers’ convenience: (1) What amount of epistemic trust (expertise, benevolence, and integrity) do student-teachers attribute to different sources of educational knowledge? (2) Can domain-specific beliefs about educational research predict epistemic trust in sources of assertions from educational research? Study 2 was designed to conceptually replicate the results from Study 1, and its hypotheses were preregistered to ensure confirmatory, hypothesis-testing research (Nosek et al., 2015).
Study 1: Exploratory Study
Procedure and Materials
A challenge when investigating students’ trust in different sources is that trust ratings are confounded by the types of information that are usually associated with specific sources. For example, when asking study participants about their trust in scientific sources, they might, in reality, state their opinions about educational theories—or their responses might at least be biased by such opinions. When asking about their trust in practitioners, they might think about one specific colleague who cherishes controversial teaching methods. This is even more problematic considering that a large proportion of variance in students’ epistemic beliefs is located at the topic level, meaning their beliefs strongly vary with regard to different topics and contexts (Merk, Kelava, Schneider, Syring, & Bohl, 2017; Merk, Rosman, Muis, Kelava, & Bohl, 2018; Trautwein & Lüdtke, 2007).
To circumvent this issue and explore whether student-teachers’ trust in scientific knowledge indeed depends on the source of such knowledge, we developed text materials that were invariant in content (i.e., contained the same body of knowledge), but varied in sources. To achieve this, five researchers independently collected curricular valid educational research topics (e.g., specific theories, effects, or findings), then discussed and evaluated the representativeness of these topics for the domain of educational research and their appropriateness for experimental manipulation. Four topics were chosen (“learning from worked-out examples,” “cognitive theory of multimedia learning,” “bullying/mobbing,” and “classroom size effects on achievement”). Subsequently, invariant text components were created that contained the core information in terms of descriptions of the theories, effects, or findings in question. Finally, the context information pertaining to the (alleged) source of the knowledge was added by means of additional sentences. To enhance the study’s internal validity, three of the five researchers were randomly assigned to the two writing steps and had to fulfill criteria concerning text length (130 words < text length < 200 words) and text complexity (50 < Läsbarhetsindex [Readability Index] = LIX < 65). The full body of material can be found in online Supplemental Appendix 1. Table 1 provides epitomes of the texts.
Epitomes From the Intervention Texts
Design
Study 1 used a between-person design. After responding to some demographic questions and filling out an epistemic beliefs inventory (see “Measurements” section below), every participant read four texts (with four different topics; see “Procedure and Materials” section above), which all contained, for each participant, the same alleged source (“practitioner,” “expert,” or “scientific study”). On reading each text, the participants additionally responded to some text-specific questions for purposes of another study (Merk, Rosman, Ruess, Syring, & Schneider, 2017). After having read all four texts, participants responded to an item battery containing among others the treatment check and the epistemic trustworthiness inventory.
Sample
Participants (N = 365, 243 females, 51% in the first two semesters) were recruited through slides during lectures and informed that participation was voluntary and could be stopped at any time, that each participant was allowed to participate in a lottery of five vouchers worth €50, and that the study would take about 40 minutes. Data collection was conducted in paper–pencil format. The questionnaires were transcribed to raw data through automated scanning software.
Measurements
Treatment Check
To ensure that the readers perceived the texts’ sources as intended, we asked them to rate authors’ characteristic occupational activities of their respective texts (question prompt: “What do you think: How frequently do the authors of your texts engage in the following activities?” sample item practitioner: “teaching at school”; sample item expert statement: “give advice to schools”; sample item scientific study: “investigating data”; response format: 6-point Likert-type scale; all items can be found in online Supplemental Appendix 2). A multiple indicators, multiple causes (MIMIC; Jöreskog & Goldberger, 1975) model with the source-specific activities as indicators and two dummy variables encoding the three sources as causation (see Figure 1) was fitted to the data. According to widespread benchmarks (Hu & Bentler, 1999; Marsh, Hau, & Wen, 2004), the model fit was not perfect, but suitable for the purposes of a treatment check (χ2 = 121.8, df = 35, CFI [comparative fit index] = .954, TLI [Tucker–Lewis index] = .929, RMSEA [root mean square error of approximation] = .84, SRMR [standardized root mean square residual] = .058). Parameter estimates indicated that the participants strongly differentiate between “practitioner” sources and the other two varieties, but rather weakly distinguish between “expert” and “scientific study” sources. Since we specified τ-congeneric measurement models, we assessed the reliability (internal consistency) of the treatment check scales with McDonald’s ω (Dunn, Baguley, & Brunsden, 2013). Reliability was good for the practitioner (ω = .846, 95% confidence interval [CI] [.815, .878]) and scientist scales (ω = .872, 95% CI [.845, .900]), but questionable for the expert scale (ω = .578, 95% CI [.494, .662]).

MIMIC model of the treatment check.
Epistemic Trustworthiness
The Muenster Epistemic Trustworthiness Inventory (METI; Hendriks et al., 2015) was used to assess the extent of epistemic trustworthiness that student-teachers attribute to the different sources. METI is constructed as a semantic differential and consists of the three dimensions: “expertise,” “benevolence,” and “integrity” (see “Measurements” section). To investigate the scales’ construct validity, we first performed a confirmatory factor analysis (CFA). We specified a model with three factors, τ-congeneric measurement models, and three freely estimated residual covariances (based on modification indices), which resulted in good model fit (χ2 = 194.0, df = 71, CFI = .956, TLI = .943, RMSEA = .072, SRMR = .061). Reliability (assessed by McDonald’s ω) was good for all three scales (expertise: ω = .884, 95% CI [.855, .912]; benevolence: ω = .868, 95% CI [.843, .893]; integrity: ω = .838, 95% CI [.784, .892]).
Epistemic Beliefs
We used a domain-specific adaptation of the German FREE questionnaire (FREE; Krettenauer, 2005; Merk, Rosman, et al., 2017), an instrument based on Kuhn and Weinstock’s (2002) framework, to assess the level of development of domain-specific epistemic beliefs. Using a scenario-based approach (e.g., Händel, Artelt, & Weinert, 2013), the instrument describes 13 well-known educational research issues (e.g., “It is repeatedly discussed whether grade retention is actually useful or should be abolished”) and prompts participants to indicate their (dis)agreement with three statements representing the three stages of epistemic development for each presented issue (6-point Likert-type scale; sample statement for the absolutism stage: Either grade retention is useful or not! Educational researchers should unequivocally clarify this in the future; multiplism: The expressions for “grade retention” are mere conjecture; no one can really know which factors contribute to school achievement; evaluativism: Even though the experts disagree, both may present more or less good reasons for their conceptions). Krettenauer (2005) suggests computing a so-called d-index (d = eval − 0.5 * mult − 0.5 * abs) for each issue/scenario. We followed this suggestion and computed a CFA on the scales’ 13 d-indices with a τ-congeneric measurement model and two freely estimated residual covariances (selected by modification indices), which showed good data adaptation to the model (χ2 = 98.8, df = 63, CFI = .930, TLI = .913, RMSEA = .043, SRMR = .047). Internal consistency of the d-index was satisfactory (McDonald’s ω = .75, 95% CI [.71, .80]).
Statistical Analyses
We decided to use multiple regression analysis (Fox & Weisberg, 2011) to answer both research questions. Multiple regression is a so-called complete data method when estimated with least squares. Since simple approaches such as listwise deletion potentially result in biased parameter estimates or lower statistical power (Rubin, 1976; Schafer & Graham, 2002), we had to explicitly deal with missing data. Hence, we used multiple imputation on our raw data (0% to 16.6% missing data) by means of chained equations (Azur, Stuart, Frangakis, & Leaf, 2011) using functions provided by the R-package “mice” (van Buuren & Groothuis-Oudshoorn, 2011). We subsequently estimated the regression models separately on all resulting (30) complete data sets and combined the results using the formulae provided by Rubin (1976).
Results
Initially, we inspected the data descriptively (see Table 2) and graphically (see Figure 2). To answer Research Question 1, we recoded the independent variable “source” into two dummy variables, IExpert and IScientific Study (reference category: practitioner), and conducted a multiple regression analysis with the z-standardized dependent variables “expertise,” “benevolence,” and “integrity.” Hence, the slope parameters of these dummy variables can be interpreted as estimates of differences between the group specified in the respective dummy variable and the reference group. As Table 3 shows, there were moderate to large differences between the practitioner group and the scientific study group in all three dimensions of epistemic trustworthiness: The authors of scientific studies are perceived not only as being less benevolent, with less integrity, but also as having more expertise (see Table 3). All these effects were statistically significant. This was, likewise, the case for differences between the practitioner and expert sources, but only for the dimensions “benevolence” and “integrity” and not for the dimension “expertise” (see Table 3).
Means and Standard Deviations of the METI Dimensions in Study 1
Note. METI = Muenster Epistemic Trustworthiness Inventory.

Violin- plots and boxplots of the results (Study 1).
Pooled Results From Multilevel Regression
Note. IExpert = dummy coded indicator variable for source “expert”; ISci. Study = dummy coded indicator variable for source “scientific study.”
p < .05. **p < .01. ***p < .001.
To investigate Research Question 2, we added epistemic beliefs (d-index) to the former models. As can be seen in Table 3, parameter estimates indicated small effects on all dimensions of epistemic trustworthiness. We interpret this as preliminary evidence for an association of epistemic development and epistemic trustworthiness: Student-teachers who believe that (scientific) educational knowledge consists not so much of “absolute facts” (absolutism) and “arbitrary opinions” (multiplism), but more of assertions whose validity can be evaluated (evaluativism), tend to show higher epistemic trustworthiness on all three dimensions (“expertise,” “benevolence,” and “integrity”).
Interim Discussion of Study 1
Study 1 investigated (1) whether student-teachers tend to trust sources of assertions from educational research (practitioner, expert, and scientific study) differentially and (2) whether their amount of epistemic trust in these sources can be predicted by their epistemic beliefs. Regarding the first question, we found what one may call a “smart but evil” stereotype, as the authors of scientific studies (i.e., scientists) are perceived not only as less benevolent, with less integrity, but also as having more expertise in contrast to practitioners. This is an intriguing finding, as it suggests that student-teachers hold a kind of distrust in scientists (“Scientists have the expertise to find answers, but they do not really want to!”). Regarding Research Question 2, we found small effects from an evaluativistic view of (scientific) educational knowledge on the epistemic trustworthiness of this knowledge’s source.
However, despite several methodological strengths (e.g., the experimental variation of sources or the high construct validity of the measurements), there are three particular limitations that motivated us to undertake a conceptual replication of these findings (Simons, 2014) in the form of a preregistered (Nosek et al., 2015; van ‘t Veer & Giner-Sorolla, 2016) and, therefore, clearly confirmatory (Wagenmakers et al., 2012) study. First, in the field of epistemic beliefs, there is an emerging call for disentangling epistemic beliefs (and related constructs) of varying specificity and different contexts (Buehl & Alexander, 2006; Merk et al., 2018; Muis et al., 2006). However, Study 1 neglects this differentiation. In fact, our participants read topic-specific assertions stemming from different sources, rated the epistemic trustworthiness aggregated for all four texts, and responded to a domain-specific measurement of epistemic beliefs.
Second (and this seems somewhat close but is in fact substantially different from the first point), we want to highlight that Study 1 only investigated source-specific differences in epistemic trustworthiness and its relation to epistemic beliefs in a between-person design. While we judge this as appropriate for a first exploratory study, a large amount of research empirically and conceptually has shown that there is substantial variation in epistemic beliefs within persons (i.e., one individual may have very different beliefs regarding different topics or domains) and between persons (i.e., individuals stemming from different domains may have different beliefs regarding the same topic or domain; Buehl & Alexander, 2001, 2006; Hofer, 2006; Limón, 2006; Merk, Kelava, et al., 2017; Muis, 2004; Trautwein, Lüdtke, & Beyer, 2004; Trautwein & Lüdtke, 2007). Thus, to ensure more detailed conclusions, we see it as crucial to assess source-specific differences in epistemic trustworthiness within and between persons simultaneously as (despite between-person differences) there might be substantial within-person variations of the “smart but evil” stereotype. For example, it is very conceivable that student-teachers view scientists as having much more expertise regarding the “cognitive theory of multimedia learning” topic, but view practitioners as having nearly equal expertise in the topic of “bullying/mobbing” while viewing scientists as having moderately more expertise overall.
Third, as mentioned above, we did specify the research questions before analyzing the data from Study 1, but we did not have specific hypotheses and no detailed a priori analysis plan. Hence, due to its exploratory nature, the evidence gathered in Study 1 is less robust than it might seem (Chambers, 2017). Therefore, we used the theoretical background and empirical results from Study 1, considered its methodological strengths and weaknesses, and derived a set of specific hypotheses that were preregistered (Merk & Rosman, 2019) and tested along a (likewise preregistered) detailed data analysis plan in Study 2. Study 2 used the same materials as Study 1 and investigated the same research questions but drew on an enhanced design. Therefore, it should be viewed as an attempt of a conceptual replication of Study 1 (Simons, 2014).
Study 2: Confirmatory Study
Research Questions and Hypotheses
Research Question 1
The first research question focuses (for both studies) on the amount of epistemic trust that student-teachers attribute to different sources of educational knowledge. In Study 1, we found what one may call a “smart but evil” stereotype: The authors of scientific studies (i.e., scientists) were perceived as less benevolent, with less integrity, but having more expertise in contrast to practitioners.
To replicate these findings conceptually, we suggest the following confirmatory hypothesis for Study 2:
Hypothesis 1: Student-teachers ascribe less integrity and benevolence, but more expertise, to scientific information sources in contrast to practitioner sources.
Research Question 2
Our second research question aims (for both studies) at investigating whether epistemic beliefs about educational research can predict epistemic trust in sources of assertions from educational research. As already outlined above, we see evaluativistic, domain-specific, epistemic beliefs as a necessary condition for epistemic trust, thereby suggesting the following hypothesis:
Hypothesis 2a: Evaluativistic, domain-specific, epistemic beliefs are positively related to ascriptions of integrity, benevolence, and expertise.
To overcome the conceptual weakness of Study 1 concerning the blurred levels of specificity (see above), we added an analogous, but more specific, hypothesis:
Hypothesis 2b: Multiplistic topic-specific, epistemic beliefs are negatively related to ascriptions of integrity, benevolence, and expertise.
Design
To enhance Study 1 while simultaneously replicating it conceptually, we planned to test both hypotheses as within- and between-persons effects simultaneously (see the “Discussion” section for Study 1). Therefore, it must be ensured that (1) every participant reads assertions stemming from different sources and that the number of sources per participant is balanced out for each participant (within-person component), (2) no participant reads the same assertion more than once, and (3) combinations of topics and sources, as well as the sequence of topics and sources, cannot confound our results. Since the main focus of the present study is the distinction between scientific and practitioner sources, and since the manipulation check regarding the “expert source” level indicated some problems, we decided to reject this “intermediate” level. This also reduces cognitive load on the participants, allowing us to include all four topics from Study 1 seamlessly without running into randomization problems.
To randomize the different sources and topics, we first created a Latin square of the four topics to achieve (incomplete) counterbalance (DePuy & Berger, 2014) in the topic factor. Subsequently, we repeated this procedure six times and addressed all possible sequences of the two texts regarding the sources “practitioner” and “scientific study” (see Table 4) to counterbalance this factor as well.
Design of Study 2: Counterbalanced Sequences of Topics and Sources
Note. we = learning from worked-out examples; cm = cognitive theory of multimedia learning; bm = bullying/mobbing; cs = classroom size effects on achievement; prac = practicioner; sci = scientific study.
Procedure and Materials
All materials in Study 2 were identical to those used in Study 1, but, corresponding to the different design of Study 2, the experimental procedure differed: Participants were assigned randomly to 1 of 24 different questionnaires (see Table 4 and online Supplemental Appendix 4 for the questionnaries of Study 2). To achieve this, we used an urn model (e.g., Wei, 1978) and true random numbers obtained at www.random.org to ensure that every questionnaire is filled out with the same frequency. Just like in Study 1, the participants first filled out the FREE questionnaire (domain-specific measurement of epistemic beliefs), then went through the four topics (sequence and sources of the assertions, depending on the questionnaire). After each topic, they filled out the METI, a topic-specific multiplism scale (see “Instruments” section below), and the items of the treatment check. Finally, the participants were asked for some demographic data.
Measurements
As mentioned above, we used the FREE and METI to assess domain-specific epistemic beliefs and epistemic trustworthiness, respectively. Both instruments are described in the “Methods” section for Study 1 and are provided at full length in online Supplemental Appendix 2, where all other instruments can be found as well. Additionally, we measured topic-specific multiplism by means of the “topic-specific multiplism” scale (4-point Likert-type scale), which was developed by decontextualizing the FREE’s multiplism items (Merk, Schneider, Syring, & Bohl, 2016) and has demonstrated good psychometric properties in several studies (Merk, Kelava, et al., 2017; Merk, Rosman, et al., 2017)
Statistical Analyses
Psychometric Properties
The psychometric properties of the only between-person measurement (FREE) were evaluated just like in Study 1. We first ran a CFA with τ-congeneric measurement models and allowed for residual covariances identified by modification indices (Standardized Expected Parameter Change; Whittaker, 2012). Reliability (internal consistency) was assessed using McDonald’s ω. Just like in Study 1, indicators of acceptable/good fit were CFI and TLI values that exceed .90/.95, RMSEA values lower than .10/.06, and SRMR values inferior to .08/.05 (Browne & Cudeck, 1992; Hu & Bentler, 1999).
The factorial validity of the within-person measurements METI, topic-specific multiplism, and treatment check was assessed using multilevel confirmatory factor analysis (MCFA; Mehta & Neale, 2005; B. O. Muthén, 1994). To do so, we specified MCFA models with τ-congeneric measurement models at each level, whereby we addressed the same cutoff values for model-fit evaluation for MCFA as in the single-level case (FREE; see above), but calculated SRMR separately for each level, whereby we defined the SRMRBetween values smaller than .10/.05, indicating acceptable/good fit.
Confirmatory Analysis Plan
For all statistical tests, the cutoff for statistical significance was a p value of .05. Our design produces clustered data, as each individual is subjected to the within-person measurements four times (once for each text). Hence, multilevel regression is an appropriate method for modeling within-person variations and between-person differences simultaneously (Gelman & Hill, 2007; Raudenbush & Bryk, 2002). Research Question 1 deals with source effects on epistemic trustworthiness, which we examined on both within-person and between-person levels.
On the within-person level, we were interested in whether METI scores vary intra-individually depending on source differences (practitioner vs. scientific study) between the four texts that each participant read. Therefore, we specified random-intercept models with a dummy-coded indicator variable (practitioner: value 1; scientific study: value 0) indicating the source as a predictor of each of the three dimensions of the METI (in three models named M1a, M1b, and M1c). These effects were tested with t tests on the fixed effects and likelihood ratio tests (LRTs, as opposed to an intercept-only model; Hox, 2010).
Regarding Research Question 1, we were interested in whether differences exist between individuals in METI scores on the same topics, depending on our source manipulation of the respective texts. Since each participant responded to exactly two “practitioner” and two “scientific study” texts (see Table 4), this must be tested separately for each topic. Hence, the source (here coded as a dummy variable) was regressed on the dimensions of epistemic trustworthiness in four single-level path models—one for each topic (Models M2a–M2d). These between-person effects were statistically evaluated using t tests for the (standardized) path coefficients. For M2a to M2d, missing values were handled using a model-immanent approach using full information maximum likelihood estimation (Finkbeiner, 1979)
To answer Research Question 2 regarding both within- and between-person effects, we extended models M1a, M1b, and M1c with topic-specific multiplism as a within-person level predictor and the FREE’s d-index as a between-person level predictor. The resulting models were labeled M3a, M3b, and M3c, respectively. Topic-specific multiplism was centered on the cluster means, so that the predictive effects of the d-index can be interpreted as effects on the person-specific means of the respective epistemic trustworthiness dimension (Trautwein & Lüdtke, 2009): Hence, one can view the effects of the source and topic-specific multiplism as within-person effects simultaneously modeled with the between-person effect of the d-index. Fixed effects were tested using t tests, along with LRTs of corresponding nested models (e.g., M1a vs. M3a).
Handling of Missing Data
As we used paper-and-pencil questionnaires, it was very likely that we would have to deal with missing data. To avoid problems associated with “naïve” handling of missing data (e.g., listwise deletion), we imputed the data by means of multiple imputation under a joint modeling perspective (Schafer & Yucel, 2002) using the R-package “pan” (Zhao & Schafer, 2016), as this package is specialized for the multiple imputation of multilevel data. Just like in Study 1, we tested the models described above on each complete data set and combined the results using the rules proposed by Rubin (1987).
Sampling Plan
Recruitment
Study participants were recruited from several teacher education courses at the University of Tübingen, Germany. Inclusion criteria were that chosen participants (1) were teacher education students at the University of Tübingen and (2) had not participated in Study 1. Adherence to these inclusion criteria was ensured by respective promotion and assessment of the coherent covariates. Participation was voluntary and took place during class time. As an incentive, all participants could participate in a lottery for vouchers worth €50.
Power Analysis
Evaluating the statistical power of the multilevel regression and single-level path models was a challenge because it depends on several factors that can be determined only empirically (e.g., variable distribution or amount of missing data). To anticipate the statistical power of the models in Study 2, we used the results from Study 1, using the Monte Carlo approach (L. K. Muthén & Muthén, 2002), in which a large set of sample data is drawn from a hypothesized population model, and parameters and standard errors are estimated for each of the sample data sets, which are then averaged.
To evaluate the power of the planned multilevel regression analyses, we set up an artificial data set corresponding to our design and subsequently sampled values for the d-index (which are independent of the experimental condition; see “Design” section above) from Study 1. In the next step, we simulated a sample of topic-specific multiplism, considering the effects of each condition’s source and topic, as well as the association to the d-index using an R package named “simr” (Green & MacLeod, 2016). Finally, we carried out the simulation study with a conservative setting: For the predictive effect of theory-specific multiplism, we assumed a “small” effect following Cohen’s benchmarks (Cohen, 1988). For the FREE’s d-index, we used the smallest effect size from Study 1. As can be seen in Figure 3, the traditional benchmark of power >.80 is achieved for both effects at a sample size of approximately 264 (11 individuals per questionnaire).

Power obtained by Monte Carlo studies for predictive effects of topic-specific multiplism and the d-index.
To anticipate the statistical power of the planned models M2a to M2d, we again used a Monte Carlo approach based on the data from Study 1. As only two sources will be used in Study 2 (practitioner and scientific study), we subsetted the data from Study 1 accordingly, ran a path model that predicts METI dimensions by a dummy variable of the source (1 = scientific study, 0 = practitioner), and used the resulting parameters as population parameters for the Monte Carlo study (see online Supplemental Appendix 3). The results of this Monte Carlo study indicate that, at a sample size of 264, coverage (proportion of results on simulated data for which the 95% confidence intervals include the true parameter value; L. K. Muthén & Muthén, 2002) and power of path coefficients (from the dummy variable to the METI dimensions) all exceed .92. Hence, we conclude that sample sizes above 264 are appropriate for Study 2. However, as greater sample sizes result in greater statistical power, we chose to recruit at least 264 participants from a specific list of courses, but not stop the data collection at a sample size of 264. To avoid problems through so-called optional stopping (John, Loewenstein, & Prelec, 2012), we first carried out all surveying (throughout the listed courses) before starting data analysis. The raw data sets from both studies will be published and archived via PsychData (Leibniz Institute for Psychology Information, Trier, 2018) and are also available at the corresponding Open Science Framework repository (Merk & Rosman, 2019).
Results
Sample
Following our sampling plan, we reached the intended sample size after the first course, which led to a final sample size of N = 278 (MSemester = 7.41, SDSemester = 0.30; 187 females). The proportion of participants studying least one STEM subject was 36.0%.
Measurements
We investigated the psychometric properties of the measurement instruments following our preregistered analysis plan. The main results are shown in Table 5, with additional details presented in the Reproducible Analysis Report of Study 2 (see online Supplemental Appendix 6). Overall, the factor structures of the instruments were confirmed with the exception of the treatment check (see below); reliabilities were fairly high, with all McDonald’s ω values exceeding .73.
Psychometric Properties of the Measurements Used in Study 2
Note. TLI = Tucker–Lewis index; CFI = comparative fit index; SRMR = standardized root mean square residual; FREE = German FREE questionnaire; METI = Muenster Epistemic Trustworthiness Inventory. If measurements were multiply applied within persons, McDonald’s ω was computed separately for each topic. The corresponding minimum and maximum values are given in the table.
Treatment Check
In a departure from our preregistered analysis code (see online Supplemental Apeendix 5), we specified only one factor at the between-level within the MCFA (see Figure 4) of the treatment check scales, due to the poor fit of the preregistered model. This model is also theoretically plausible, as between-person scores of the treatment check can be interpreted as averages per person. This modified model yielded a very good fit, and the y-standardized path coefficients of the extended MIMIC model provided strong evidence for a successful treatment. For example, students who read texts containing information allegedly stemming from practitioners judged the practitioner rating scale, on average, to be more than one and a half standard deviations higher.

Results of the treatment check in Study 2.
Research Question 1
To investigate the hypothesized “smart but evil” stereotype, we preregistered a series of models testing it at the within-person level (M1a–M1c) and at the between-person level (M2a–M2d, see the “Confirmatory Analysis Plan” section for details). The results of Models M1a to M1c can be obtained from Table 6: The regression coefficients of the dummy-coded indicator variable of the source Isource = pr. (1 if source = practitioner, 0 otherwise) became significant in all models, and the regression weights indicated effects of moderate size in the expected direction. We thus infer that our participants exhibit a “smart but evil” stereotype at the within-person level. This stereotype manifested itself partially at the between-person level when we predicted expertise, benevolence and integrity by Isource = pr. consecutively for each topic (M2a–M2d, see Figure 5): 9 out of 12 regression coefficients became significant, with most indicating largely moderate effect sizes.
Standardized Results of the Random Intercept Models for Research Question 2 (Study 2)
Note. Isource = pr. = dummy coded indicator (1 if source = “practicioner”, 0 otherwise); tm = topic-specific multiplism; di = d-index; LRT = likelihood ratio test; RIV = relative increase in variance due to nonresponse; FMI = fraction of missing information. Boldfaced coefficients indicate p < .05.

Results of Models M2a/M2b/M2c/M2d testing the “smart but evil” hypothesis on the between-person level consecutively for each topic.
Research Question 2
To investigate Research Question 2, we expanded M1a and M2a with topic-specific multiplism as the within-person predictor and the d-index as the between-person predictor of epistemic trust (M3a–M3c), just as we envisaged in the preregistered analysis plan. Conforming to our hypotheses, topic-specific multiplism was significantly predictive for expertise, benevolence, and integrity, revealing small to moderate effects. Contrary to our hypotheses, however, the point estimate of the regression weight of the d-index was very small and not significant.
General Discussion
In this registered report, we experimentally investigated student-teachers’ epistemic trust in educational scientists compared with experts and practitioners. In one exploratory study and one strictly confirmatory and preregistered study, we found strong evidence for a “smart but evil” stereotype mainly in accordance with our hypotheses: Student-teachers judged educational scientists as having more expertise but less benevolence and less integrity than practitioners from the educational domain, whereby the between-person effects from Study 1 were larger than those from Study 2, which also were insignificant in three (out of 12) cases (Hypothesis 1). Furthermore, we found strong evidence for a negative association of topic-specific multiplism and epistemic trust (Hypothesis 2b) but more inconclusive evidence regarding a positive association of domain-specific evaluativistic beliefs and epistemic trust (Hypothesis 2a).
Apart from the benefits that arise from preregistration, the fact that we controlled for the topics and knowledge claims included in our texts underlines the robustness of these findings. However, predicting this stereotype with epistemic beliefs was only partially successful: While topic-specific multiplism was significantly related to trustworthiness in both studies, domain-specific epistemic beliefs only predicted trustworthiness in Study 1. In the paragraphs below, we first discuss the methodological strengths and weaknesses of both studies; subsequently, we suggest future directions for research and potential practical consequences for teacher education.
A major strength of the present research is that we were able to replicate an interesting exploratory finding (Study 1) using a comparatively strong confirmatory approach (preregistered Study 2). In fact, preregistration has been shown to lower the likelihood of false-positive findings (Nelson, Simmons, & Simonsohn, 2018), with calls for replications becoming more pronounced in recent years (Makel & Plucker, 2014). Furthermore, investigating the “smart but evil” stereotype using an experimental design is an additional strength of our studies because carefully counterbalancing different combinations of sources and topics should increase the internal validity of our results. Finally, internal validity was also strengthened by the fact that we investigated our hypotheses using both between-person and within-person designs. Concurrently, however, external validity is limited by the fact that all participants studied at the same university.
Furthermore, it should be pointed out that 3 of 12 nonsignificant p values regarding the between-person effects in Study 2 were inconclusive (Amrhein, Greenland, & McShane, 2019; Dienes, 2014); it remains unclear whether they result from insufficient statistical power or the absence of effects. This points to a central weakness in our studies: Whereas presenting our participants with specific topics and assessing trustworthiness and epistemic beliefs regarding those topics allowed us to construct an internally valid study, external validity might have suffered from this approach. For example, one cannot directly conclude what would have happened if we had chosen another set of topics. Therefore, even though we chose a set of fairly typical educational topics, generalizing our findings to other topics or to the domain of educational research generally should only be done with caution. Moreover, we concede that our effect sizes were somewhat smaller than expected, which might be caused by our rather minimal manipulation (only changing certain textual cues). The effects might thus be stronger in a study with higher external validity, for example, when confronting students with actual teachers or scientists. Hence, according to Prentice and Miller (1992), even the small effects in our study might have considerable behavioral implications—but this assumption should be tested in future studies, of course.
Another inference that should be handled carefully is the results of our second research question. In fact, there are inconsistent results between the two studies (significant effects of the d-index on epistemic trustworthiness) and within Study 2 (significant effects of topic-specific multiplism, but no effects of the [domain-specific] d-index) regarding the effects of epistemic beliefs on epistemic trustworthiness. This may be due to a theoretical assumption pointed out earlier by Schraw (2001) and Bråten and Strømsø (2010), who emphasize that epistemic beliefs at different specificity levels may have the strongest impact on dependent variables that are at the same levels of specificity. This is coherent with our findings from Study 2, in which topic-specific multiplism significantly predicted topic-specific trustworthiness, whereas the domain-specific d-index did not.
In addition to the limitations mentioned above, we emphasize that the studies presented here focused on the existence of the “smart but evil” stereotype, not on its genesis or consequences. Both topics may be fruitfully studied in the future. The theoretical outline presented above suggests that student-teachers, on one hand, are obliged to trust the utterances of educational researchers due to the cognitive division of labor. On the other hand, their epistemic vigilance should lower the risk of being manipulated through misinformation. But why do student-teachers show higher vigilance (as shown by lower ratings of benevolence and integrity) toward educational scientists than to practitioners? This is an open question that could be investigated by referring to theories from social psychology such as intergroup relations (Brewer, 1999; Tajfel & Turner, 1986). For example, Brewer (1999) suggests that individuals usually ascribe higher trustworthiness to members of their ingroup than to those in the out-group, and student-teachers might regard actual teachers as more of an in-group than educational researchers. Another direction of future research might address the generalizability of our findings to other universities, to different academic and professional domains (beyond educational science and teaching), and to other cultural contexts. All study materials and instruments are freely available at the Open Science Framework (Merk & Rosman, 2019), and we welcome direct or conceptual replications of our studies as well as related research. In particular, it might be interesting to investigate which consequences or effects of this magnitude may show on (pre)service teacher’s behavior: Will they choose other sources (academic textbook vs. blog entry by a teacher) while preparing their lessons? Will they integrate information from various sources in a different way?
With regard to the practical implications of our findings—conceding that further knowledge about the genesis of the “smart but evil” stereotype is necessary to draw strong evidence-based conclusions—several assertions can be made. First, making oneself aware of the existence of the stereotype and talking about it with students may be a first step in overcoming its problematic nature. Second, one might strive to design interventions to directly increase student-teachers’ trust in educational research. In line with our deliberations on intergroup relations (see above), this might be done, for example, by referring to the method of imagined intergroup contact (e.g., Vezzali, Capozza, Stathi, & Giovannini, 2012). Third, considering the moderate impact of epistemic beliefs on epistemic trustworthiness, interventions that foster students’ epistemic beliefs might also be worthwhile in this context (Kerwer & Rosman, 2018; Rosman, Mayer, Merk, & Kerwer, 2019). Finally, we would like to issue a general call for transparency in research and stronger efforts in science communication: If researchers publicly preregister their hypotheses, share their materials and data and put more effort into communicating their results modestly and in plain language, teachers (and student-teachers) might trust them more.
Supplemental Material
DS_10.1177_2332858419868158 – Supplemental material for Smart but Evil? Student-Teachers’ Perception of Educational Researchers’ Epistemic Trustworthiness
Supplemental material, DS_10.1177_2332858419868158 for Smart but Evil? Student-Teachers’ Perception of Educational Researchers’ Epistemic Trustworthiness by Samuel Merk and Tom Rosman in AERA Open
Footnotes
Acknowledgements
This research was supported in part by the Institutional Strategy of the University of Tübingen (Deutsche Forschungsgemeinschaft, ZUK 63).
Authors
SAMUEL MERK is Junior Professor for Education at the University of Tübingen, Tübingen, Germany. He is interested in teacher education, epistemic beliefs, and open science.
TOM ROSMAN is a research associate at the Leibniz Institute for Psychology Information, Trier, Germany. He is interested in epistemic beliefs, information literacy, and frame of reference models.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
