Abstract
Today’s economy demands higher order thinking skills (HOTS), and the public education system has a critical role in supporting students’ acquisition of HOTS. Yet, numerous studies documented inequity in access to higher quality instruction that promotes HOTS, which could result in wide test score gaps in HOTS. In this study, I examined test score gaps in HOTS and explored instructional practices associated with HOTS, particularly among low-performing students, using large-scale international assessment data from the 2015 Trends in Mathematics and Science Study. I found wide test score gaps in HOTS in mathematics between the lowest and highest socioeconomic status students and between White students and students of color. Instructional practices such as the same ability group work, asking students to work on problems with teacher guidance, and working on problems with no immediately obvious method of solution were found positively associated with the test scores.
Today’s economy demands higher order thinking skills (HOTS) for innovation and development, especially in the science, technology, engineering, and mathematics (STEM) areas. A recent report by the World Economic Forum (2015) showed that jobs requiring nonroutine interpersonal and analytical skills have increased gradually, while jobs requiring routine manual and cognitive skills declined steadily between 1960 and 2010. The Pew Research Center (2016) also reported that jobs that require higher levels of analytical skills have increased by 77% since 1980. Many companies, organizations, and media now call for 21st-century skills, which generally include critical thinking skills, problem-solving skills, information and communication technology literacy, and leadership skills (Partnership for 21st Century Learning, n.d.; World Economic Forum, 2015), 1 and economic and social returns to these skills are likely to increase in near future. HOTS can be viewed as a cognitive dimension of 21st-century skills and relates to critical thinking skills and problem-solving skills. Not possessing HOTS may, therefore, put one in a disadvantaged position in the future labor market.
The public education system has a critical role in supporting students’ acquisition of HOTS because these skills can be enhanced through student-centered instruction and are transferrable after they exit PreK–12 education systems (Guskey, 2007; Halpern, 1998; Hmelo & Ferrari, 1997). In fact, 60% of Americans think that public schools should have a vast responsibility for ensuring that future workers have the right skills to succeed in today’s and future economy (Pew Research Center, 2016).
Unfortunately, not all students receive adequate education to acquire HOTS. It is well documented that teacher quality is inequitably distributed across schools (e.g., Clotfelter et al., 2005; Clotfelter et al., 2007; Goldhaber et al., 2015; Hanushek et al., 2004; Lankford et al., 2002). This educator sorting contributes to inequity in instructional quality because low socioeconomic status (SES) schools and schools serving large proportions of students of color tend to have inexperienced teachers who are new to the profession, often struggle with foundational teaching skills, and lack the skills to implement student-centered learning (Goldhaber et al., 2015; Maas et al., 2018; Mehta, 2014; Noguera et al., 2015).
Academic tracking and instructional time may be other factors that contribute to unequal access to HOTS. Research shows that low-SES students and students of color are more likely to be placed in basic classes, which focus on mastery of low-level skills, and receive less instructional time (Burris & Welner, 2005; Gamoran, 1987; Mickelson, 2001; Oakes, 2005; Rogers et al., 2014).
Furthermore, students’ home environments could play a significant role in their readiness to learn from higher order thinking (HOT) activities. Research shows that students from low-SES families have fewer opportunities to HOT activities (e.g., Kalil, 2015; Kalil et al., 2016; Lareau, 2003; Phillips, 2011). It also documents that conversations between students of color and their parents tend to be directive and authoritarian, which provides the students less time to engage in HOT activities (Lareau, 2003).
Students studying at these schools and from such home environments could be exposed differentially to instructional practices that put the students in the center of learning and demand high cognitive processes. While prior research generally suggests that they benefit from such instructional practices (e.g., Boaler, 2002; L. M. Martin & Halpern, 2011; Zohar & Dori, 2003), it is theoretically possible they learn differently by student subgroup and type of instructional practices. As a result of these differences in school and home environments, test score gaps in HOTS may exist among student subgroups. Likely, students respond differently to instructional practices that engage students in HOT activities.
Surprisingly, most of the prior work focused on subject-level test score gaps and have not explored the gaps in HOTS explicitly or the factors that relate to improvements in HOTS among low-performing student subgroups (e.g., Curran & Kellogg, 2016; Fryer & Levitt, 2004; Hanushek & Rivkin, 2006; Reardon, 2011; Reardon & Galindo, 2009). To advance our understanding of the test score gaps and instructional practices that are associated with HOTS, this study examines the gaps among student subgroups in eighth-grade mathematics and explores instructional practices that may be linked to HOTS using large-scale international assessment data from the 2015 Trends in International Mathematics and Science Study (TIMSS). It focuses on mathematics because the number of STEM occupations has been growing much faster than non-STEM occupations (Fayer et al., 2017), and mathematics skills are essential to succeed in STEM fields (National Council of Teachers of Mathematics, 2018). If the test score gaps in HOTS in mathematics persist, it is likely to increase inequity and inequality in later adult outcomes.
More formally, this study explores three research questions. First, it examines test score gaps in HOTS by SES status and race/ethnicity. I hypothesize that White students and high-SES students outperform their counterparts. Second, it investigates instructional practices that may be related to higher test scores in HOTS. I hypothesize that student-centered instructional practices are associated with higher test scores in HOTS. Third, it examines whether the relationships between these instructional practices and the test scores vary by students’ SES status and racial/ethnic backgrounds. I hypothesize that such variation exists. The third question aims to identify instructional practices for each student subgroup, particularly the lowest SES students and students of color, that relate to higher test scores in HOTS.
For the third research question, since HOT activities demand time and necessitate students’ prior and current knowledge, instructional time, and content coverage could moderate the relationships. This is a plausible hypothesis based on Carroll’s (1963) school learning model and prior research (de Jong & Lazonder, 2014; Etkina et al., 2008; Luyten, 2017; Masek & Yamin, 2011; Polikoff & Porter, 2014; Zimmerman, 2007). I examine whether a moderation effect exists by instructional time and/or content coverage. Due to the space limitation, I reported and discussed the results in the online supplement.
It is important to note that I use the term, test score gaps, throughout the article instead of the commonly used term, achievement gaps because the latter term has negative connotations that treat White, affluent student achievement as a norm (e.g., Love, 2004). The use of this language may affect the priorities of educators and lead us to develop short-term solutions that do not address the root causes of the problem (Ladson-Billings, 2006; Quinn et al., 2019). It is also important to note that many other factors contribute to the test score gaps. Disentangling their effects is methodologically challenging and beyond the scope of this study. Instead, the current study analyzes the gaps and the associations between instructional practices and test scores more descriptively and does not intend to make a causal inference. The results point to the direction for further investigation in which researchers may use a more rigorous research design to estimate a causal effect.
This article proceeds as follows. The first section describes HOTS and discusses instructional practices that could be theoretically related to HOTS and some sources of test score gaps in HOTS. The next section reviews prior studies on test score gaps. The following section describes the TIMSS assessment data and explains the methods used. After the method section, it reports findings and concludes with discussions and implications.
Higher Order Thinking Skills
Many people use the term, HOTS, to describe some form of complex thinking that demands high cognitive processes. It was a term developed to merge two different perspectives about critical thinking from the field of philosophy, which viewed it as evaluation or judgment, and from the field of psychology, which viewed it as problem solving (Lewis & Smith, 1993). HOT is defined as thinking that occurs “when a person takes new information and information stored in memory and interrelates and/or rearranges and extends this information to achieve a purpose or find possible answers in perplexing situations” (Lewis & Smith, 1993, p. 136). As the historical development of the term indicates, HOT includes evaluation or judgment, problem solving, as well as creative thinking, and decision making (Lewis & Smith, 1993). Since HOTS results from a merger of the two different perspectives on critical thinking, it also includes broadly defined critical thinking as its component. Reasoning or productive thinking is part of problem solving, as it is used to integrate past experiences that have not been associated to find a solution to a novel challenge (Lewis & Smith, 1993).
Other researchers, more recently, define HOTS differently but their definitions generally capture similar components. For example, Schraw et al. (2011) view HOTS as thinking skills composed of reasoning, argumentation, problem solving and critical thinking, and metacognition. Brookhart (2010) defines HOTS in terms of transfer (i.e., making sense of and being able to use what one has learned), critical thinking, and problem solving. Richland and Simms (2015) argue that the underlying mechanism across these components is relational or analogical reasoning, which is “the process of representing information and objects in the world as systems of relationships, such that these systems of relationships can be compared, contrasted, and combined in novel ways depending on contextual goals” (Richland & Simms, 2015, p. 177).
Bloom’s original and revised taxonomy provides additional inputs about what HOTS means. It classifies learning objectives into major categories in the cognitive process dimension (Krathwohl, 2002). The revised framework, as shown in Figure 1, includes the following six categories: remember, understand, apply, analyze, evaluate, and create (see Bloom et al., 1956, for the original taxonomy, and Anderson et al., 2001, for the revised one). The framework also includes the knowledge dimension, which is foundational to perform the six cognitive processes. These process categories are ordered in terms of their complexity with remember being the least complex and create being the most complex; yet the framework is not strictly hierarchical (Ennis, 1993). The categories are allowed to overlap and theoretically interdependent (Ennis, 1993; Krathwohl, 2002). The upper three levels are closely aligned with several definitions offered earlier and generally considered as HOTS (Richland & Simms, 2015; Tankersley, 2005, as cited in Hitchcock, 2020; Zohar & Dori, 2003). The definition and example of each of these cognitive processes are provided in Appendix Table 1, which is part of Table 5.1 in Anderson et al. (2001).

Bloom’s taxonomy.
HOTS Specific to Mathematics
In mathematics, HOTS is implicitly embedded in the Common Core State Standards for Mathematics (CCSSM). 2 The CCSSM includes content and practice standards for mathematics, and the latter standards describe eight cognitive processes (Common Core State Standards Initiative, n.d.). The practice standards include standards in the following eight domains: (1) make sense of problems and persevere in solving them; (2) reason abstractly and quantitatively; (3) construct viable arguments and critique the reasoning of others; (4) model with mathematics; (5) use appropriate tools strategically; (6) attend to precision; (7) look for and make use of structure; and (8) look for and express regularity in repeated reasoning. All of them are closely related to HOTS. Richland and Begolli (2016) offer a detailed analysis of these processes concerning analogical reasoning, which they argue is the underlying mechanism of HOTS.
Instructional Practices Related to Students’ HOTS
The literature on cognitive psychology and teaching and learning provides scientific evidence on instructional practices that are positively associated with students’ HOTS. One approach is a classroom and small-group discussion and dialogue. Studies found that dialogue and discussions can promote student learning and help students develop HOTS (e.g., Abrami et al., 2015; King, 1992; Miri et al., 2007; Murphy et al., 2016; Slavin, 2011; Snyder & Snyder, 2008). In this approach, teachers encourage students to discuss their thoughts with their classmates and explain their answers and reasons. This provides the students with opportunities to think deeply about concepts; integrate them with their prior knowledge; evaluate and assess their thoughts, assumptions, and understandings; and make a decision about what to believe or do (Bailin et al., 1999; Bonney & Sternberg, 2011; King, 2002). This higher order metacognitive process helps them construct new knowledge. Teachers may facilitate the discussion through sets of questioning strategies such as the Ask to Think Tel-Why model, the Cognitive Tools and Intellectual Roles approach, and the Guided Reciprocal Peer Questioning approach (Gillies, 2011; King, 2002).
Teachers may use mixed or same ability grouping for small group discussions. The literature provides mixed evidence about the effectiveness of ability grouping. A recent meta-analysis found that students learn more when they are grouped by ability within the classroom (Steenbergen-Hu et al., 2016). The mixed-ability grouping may enhance students’ HOTS as well if it provides students opportunities to experience HOT activities. Yet, if group work is designed such that a single group product is required and/or no group reward/incentive is provided, less able students may not learn very much because more able students do most of the work (Slavin, 2011).
Another approach teachers may use is inquiry-based learning. In this approach, teachers provide their students enough time and adequate scaffolding/guidance to explore a topic or work on a problem within parameters set by the teachers (Marshall & Horton, 2011). During the self-directed investigation, the students may draw on their prior knowledge and understanding (Marshall & Horton, 2011). Inquiry-based learning is found correlated with high-level cognitive thinking (Lazonder & Harmsen, 2016; Marshall & Horton, 2011). The effectiveness of inquiry-based learning depends on whether the teachers provide appropriate scaffolding/guidance (Lazonder & Harmsen, 2016).
Problem-based learning (PBL) is also found positively associated with HOTS if it provides well-designed problems and adequate scaffolding (e.g., Hung et al., 2008; Weiss, 2003). In PBL, students work on a complex, real-world problem where no immediately obvious method of solution is available. Such a problem requires knowledge slightly beyond the students’ current knowledge and engages them in collaborative inquiry in small groups (Lu et al., 2014; Weiss, 2003). The students initiate their prior knowledge and apply it to the problem, discuss the nature of the problem with their peers and identify knowledge gaps, evaluate and synthesize proposed ideas, solve controversies and make decisions based on the discussion, and refine their current knowledge and construct new understanding (Lu et al., 2014; Weiss, 2003).
These specific approaches use multiple instructional practices that do not have specific labels but are key to engaging students in active learning. Some of them include relating the lesson to students’ daily lives, asking them to explain their answers, encouraging them to express their ideas, and linking new content to their prior knowledge. Research shows that these practices are positively related to HOTS. For example, Miri et al. (2007) found that students improved their HOTS when they dealt with real-world problems, engaged in open-ended discussions, and experienced inquiry-oriented experiments. Slavin (2011) argues that one of the most effective ways that facilitate the elaboration of the content is to have students explain the content to someone else.
These and other instructional practices not described above necessitate prior knowledge, and students with such knowledge would benefit more from HOT activities (Richland & Simms, 2015). This does not mean that students with less prior knowledge do not benefit from such activities. Research shows evidence that students develop HOTS, whether their initial academic skill levels are low, average, or high (e.g., Crosnoe et al., 2010; Zohar & Dori, 2003), suggesting that a mastery of basic skills may not be always necessary to acquire HOTS.
Instructional Time and Content Coverage
Carroll’s (1963) school learning model posits that the degree of student learning is a function of time needed for learning and time spent on learning, the former of which is a function of aptitude, ability to understand instruction, and quality of instruction. The latter is a function of time allowed for learning and perseverance. The model suggests that, while students’ abilities, attitudes, and dispositions play a role in their learning, students may generally learn more if instructional time increases. Given that HOT activities require a good amount of time, this interactive effect is plausible. Prior research and reviews suggest this possibility (e.g., de Jong & Lazonder, 2014; Etkina et al., 2008; Masek & Yamin, 2011; Zimmerman, 2007).
Another important factor that arises from Carroll’s model is an opportunity to learn. Although time allowed for learning is labeled as opportunity to learn in his model, the concept now involves more than a simple time dimension (Floden, 2002). A commonly used definition focuses on content coverage or the extent of student exposure to assessed topics (e.g., Scheerens, 2017; Schmidt et al., 2015; Schmidt et al., 2019). 3 Limited exposure to content results in less prior and current knowledge, which may affect the effectiveness of the instructional practices (e.g., Richland & Simms, 2015). Similar to instructional time, this suggests a possible interactive relationship between the instructional practices and content coverage (e.g., Luyten, 2017; Polikoff & Porter, 2014).
(Some) Sources of Test Score Gaps in HOTS
Test score gaps in HOTS could result from systematic differences in students’ school and home environments, as underscored in the ecological framework (Bronfenbrenner, 1979, 1986). An in-depth discussion of each possible source is beyond the scope of the current study. Instead, I focus on some of the important factors that are likely to be associated with the gaps in HOTS.
The first factor is teacher experience and sorting. Some studies found that less experienced teachers tend to ask students more factual and lower order thinking questions; other studies reported that new teachers often struggle with classroom management (e.g., Castro et al., 2010; Dias-Lacy & Guirguis, 2017; He & Cooper, 2011). Since students of color and low-SES students are more likely to be assigned to less experienced teachers (e.g., Clotfelter et al., 2005; Clotfelter et al., 2007; Goldhaber et al., 2015; Hanushek et al., 2004; Lankford et al., 2002), their opportunities for HOT activities may be limited.
Another possible source of the gap is teacher expectation. HOT demands high-level cognitive processes. Some teachers think that it is less appropriate for low-performing students because they perceive that learning occurs hierarchically and HOT occurs after mastering prerequisite skills (Zohar et al., 2001). They tend to believe that HOT activities are more effective with high-SES, high-achieving students (Warburton & Torff, 2005). If teachers collectively hold low expectations for HOTS among students of color and low-SES students, it forms a poor academic culture at the school, and the students may suffer from instruction that rarely requires HOTS.
Academic tracking deprives students of color and low-SES students of opportunities for HOT activities. Research shows that they are more likely to be placed into basic classes and programs, and their schools provide them limited access to advanced classes (Burris & Welner, 2005; Gamoran, 1987; Ladson-Billings, 1997; Mickelson, 2001; Oakes, 2005; Patrick et al., 2020). Teachers in these classes and programs tend to use traditional instructional practices and provide them fewer opportunities for HOT activities (Darling-Hammond, 2001; Desimone & Long, 2010; Ladson-Billings, 1997; Noguera et al., 2015). This suggests that teachers’ use of the instructional practices discussed earlier reflect academic tracking.
Schools serving large proportions of students of color and low-SES/income students generally struggle with implementing instructional practices for HOT (Maas et al., 2018; Noguera et al., 2015). These schools tend to be low-performing schools and have to shift resources to prepare students for standardized testing, which does not necessarily assess students’ HOTS (Au, 2007; He & Cooper, 2011; Noguera et al., 2015). Generally, teachers at these schools tend to be less effective at providing rigorous and engaging instruction, particularly for students with weaker academic and social–emotional skills (Maas et al., 2018).
Students’ home environments also play a role. Prior research shows that children from high-SES/income families are exposed to a variety of cognitive activities daily at home that may influence the development of HOTS. For example, their parents read books to their children when they are young, have language-rich conversations and ask HOT questions, and oversee homework completion (e.g., Altintas, 2016; Hart & Risley, 1995; Kalil, 2015; Wilder, 2014). On the other hand, low-SES/income parents tend to provide their children fewer cognitively stimulating activities at home (e.g., Kalil, 2015; Kalil et al., 2016; Lareau, 2003; Phillips, 2011). These differences in home environments could lead to test score gaps in HOTS. A recent study shows that family income affects the onset and the development trajectories of HOT talk among children between 14 months and 58 months (Frausel et al., 2020). It found that children from higher income families start using HOT talk earlier than those from lower income families. It also found that family income, parent education, and parent IQ are positively correlated with children’s HOT outcomes in grade school.
These systematic differences in students’ school and home environments could be linked to possible differences in students’ readiness for HOT activities. One student subgroup may learn more from a given instructional practice than other subgroups. Similarly, within subgroups, some instructional practices may be more strongly related to students’ HOTS than other practices.
Studies on Test Score Gaps
Myriad researchers and organizations have reported subject-level test score gaps among student subgroups since the 1966 Coleman Report. Although the magnitude of the gaps varies from study to study, they generally found low-SES students, students of color, and English language learner (ELL) students underperform their counterparts in standardized tests in mathematics, reading, and science at all grade levels (e.g., Clotfelter et al., 2009; Duncan & Magnuson, 2011; Fryer & Levitt, 2004; Hemphill & Vanneman, 2011; Jencks & Phillips, 1998; Reardon, 2011; Reardon & Galindo, 2009; Rumberger & Tran, 2010). For example, Clotfelter et al. (2009) found that the raw test score gap in Grade 3 between Black and White students was 0.78 SD in mathematics and 0.71 SD in reading. The gaps remained sizable even after regression-based adjustments. Reardon (2011) and Duncan and Magnuson (2011) reported that the test score gaps by income level and SES are more than 1 SD, which is roughly equivalent to 3 to 6 years of learning. The gap was more than double the size of the Black–White test score gap. The gaps between ELL students and non-ELL students are smaller and less than 1 SD (Hemphill & Vanneman, 2011; Reardon & Galindo, 2009; Rumberger & Tran, 2010).
These gaps already exist when children enter kindergarten and even among toddlers and preschoolers in terms of vocabulary and language development. For instance, toddlers raised by lower SES families are 6 months behind toddlers from higher SES families in language proficiency during the first 24 months (Fernald et al., 2013). Low-income children are exposed to fewer vocabulary words than high-income families, contributing to the language gap (Hart & Risley, 1995). When these children start kindergarten, they score 1.3 SD lower than those from low-need/higher SES families in entry mathematics skill assessments (Duncan & Magnuson, 2011). In science, a somewhat smaller but still sizable gap is observed between kindergarteners from lower and higher income families (Curran, 2017). By race and ethnicity, the test score gap is 0.82 SD between Black and White students and 0.94 SD between Hispanic and White students (Curran & Kellog, 2016).
Compared with the volume of research on the subject-level test score gaps, test score gaps in HOTS have received scant attention in the literature. Such information is essential for practitioners to design better academic programs. The current study fills this gap in the literature.
Data
In this study, I used international large-scale assessment data from the TIMSS 2015, from which I extracted U.S. assessment data in eighth-grade mathematics. 4 TIMSS is a repeated cross-sectional international large-scale assessment study conducted by the International Association for the Evaluation of Educational Achievement every 4 years since 1995. The United States has participated in the study for all cycles since 1995. 5
TIMSS uses a stratified two-stage cluster sampling design with schools being the first sampling unit and intact classrooms being the second sampling unit (M. O. Martin & Mullis, 2012). In each administration, it samples students at Grade 4 and Grade 8 separately and assesses their cognitive skills in mathematics and science at the subject level and two domain levels (i.e., content and cognitive domains). The cognitive domain in both grades and subjects includes the following three cognitive domains: knowing, applying, and reasoning (Mullis et al., 2009). 6
In addition to the assessment, TIMSS collects contextual factors through surveys of students, their teachers, and their schools. All of these data can be merged into a single data file. For this study, after combining all TIMSS data files, I merged them with the 2014–15 school-level Common Core of Data (CCD) from the U.S. Department of Education to incorporate school information on the percentage of students eligible for the federal free/reduced lunch program, which was used to construct a socioeconomic status variable. 7
The sample consisted of 9,630 eighth-grade students in 500 classrooms taught by 390 math teachers at 230 public schools. 8 Students in private schools were excluded from this study because the focus was on whether the U.S. public education system is equitable in terms of teaching students HOTS and preparing them to be successful in the future economy.
TIMSS Assessments
The TIMSS assessment uses a matrix-sampling approach, in which the entire pool of assessment items at each grade level is divided into 14 booklets for each subject, and each student completes one assessment booklet per subject (Mullis et al., 2009). 9 Assessment items include multiple-choice and constructed-response questions and at least one half of the total number of points come from multiple-choice questions. The assessment takes 90 minutes, and additional 30 minutes are spent on the questionnaire at the eighth grade (Mullis et al., 2009).
Because students’ abilities are measured based on a small set of assessment items, there is a substantial amount of measurement error. To address this problem and obtain unbiased parameter estimates, the TIMSS uses plausible values, which represent the likely distribution of a student’s academic ability (e.g., von Davier et al., 2009). The plausible values are randomly drawn from the posterior distributions, five times for each student. I used all of the five plausible values to estimate parameters through multiple imputation techniques with complex survey weights (see Allison, 2002; Little & Rubin, 2002). Standard errors were estimated through the Jackknife repeated replication variance estimation method.
Cognitive Domain in TIMSS
The cognitive domain consists of three domains: knowing, applying, and reasoning (Mullis & Martin, 2013). The knowing domain includes six categories: recall, recognize, classify/order, compute, retrieve, and measure. These processes require relatively simple cognitive processes and do not demand HOT. The applying domain is composed of three categories: determine, represent/model, and implement. In this domain, students apply mathematical facts, concepts, and procedures they already understand to real-life situations or purely mathematical questions they are familiar with and solve problems (Mullis & Martin, 2013). Problem solving is central in this domain and part of HOTS; yet, the scope of this domain is limited to the application of knowledge to familiar situations and problems, rather than complex, novel situations. In this sense, this cognitive process does not fully capture HOTS. I used test scores in the knowing and applying domains as references.
The last cognitive domain, reasoning, includes the following six categories: analyze, integrate/synthesize, evaluate, draw conclusions, generalize, and justify. These cognitive processes match with HOTS that I described earlier and the CCSSM very well. I used test scores in the reasoning domain as HOTS in this study. 10 Figure 2 shows a sample of eighth grade mathematics questions in the applying and reasoning domains. Appendix Table 2 provides more detailed information about each cognitive process taken from the TIMSS 2015 Mathematics Framework (see Grønmo et al., 2013).

Sample assessment item in the (a) applying domain and (b) reasoning domain.
Variables
Race/Ethnicity
I used a race/ethnicity variable to create indicator variables for students’ race/ethnicity. It includes the following seven race/ethnicity categories: White (not Hispanic), Black (not Hispanic), Hispanic, Asian, Native American, Pacific Islander, and two or more races. I created indicator variables for the first four race/ethnicity categories and collapsed the rest into the other race/ethnicity category.
Socioeconomic Status
The TIMSS 2015 data files do not include a variable for students’ SES, so I created it using a set of variables available in the data files. Researchers suggest that SES components should include family income, parental educational attainment, and parental occupational status as well as neighborhood and school SES (Brunner, 2014; Cowan et al., 2012). Unfortunately, the TIMSS data files are limited regarding direct measures of some of these SES components. I constructed an SES variable using variables that can be used as proxies for some aspects of these SES components.
First, for family income, I used survey items related to possessions of such items as books, computers, internet connection, and own room. I also used survey items related to activities outside of school such as playing on a sports team, playing a musical instrument, and belonging to a club, because students often cannot engage in these activities without adequate financial capital. I did not use parental educational attainment, since more than 20% of the data were missing. 11 For school and neighborhood SES, I used a school-level variable on the percentage of students eligible for the federal free/reduced lunch program from the CCD data file.
To construct a single SES variable, I used principal component analysis based on polychoric correlations. I created a single, standardized SES variable and divided it into quintiles. 12 Appendix Table 3 provides a complete list of survey items and data used to create this variable.
Instructional Practices Related to HOTS
TIMSS data include sets of questions implicitly and explicitly related to the instructional practices described earlier. One set of questions asked teachers how often they relate the lesson to students’ daily lives, ask students to explain their answers, ask students to complete challenging exercises that require them to go beyond the instruction, encourage classroom discussions among students, link new content to students’ prior knowledge, ask students to decide their problem-solving procedures, and encourage students to express their ideas in class. These instructional practices underlie the specific instructional approaches described earlier and the CCSSM. I reverse-coded teachers’ responses with 1 being never and 4 being every or almost every lesson, took the mean across the seven questions, and labeled it as engaging instructional practices. 13 Appendix Table 4 reports the means and standard deviations of these seven survey items.
Another set of questions asked teachers how often they ask students to work on problems (individually or with peers) with their guidance, work on problems for which there is no immediately obvious method of solution, work in mixed ability groups, and work in same ability groups. The first two practices relate to inquiry-based learning with scaffolding/guidance and PBL 14 ; and the last two practices are concerned with small group learning. Similar to the first set of questions, for each practice, I reverse-coded teachers’ responses with 1 being never and 4 being every or almost every lesson. Note that these variables only measure the frequencies of these practices, not the quality.
For instructional time, I used a variable that measures the number of minutes spent on teaching mathematics to the students in the TIMSS class per week. I transformed minutes into hours. For content coverage, I used teachers’ responses to sets of questions about content coverage in all four content areas in the content domain (i.e., number, algebra, geometry, and data and chance). Each content area has multiple topics, and teachers were asked to indicate whether a topic in each content area was mostly taught before this year (i.e., 2014–2015), was mostly taught this year, or was not yet taught or just introduced. There are 20 topics across the four content areas. I created an indicator variable for each topic that takes a value of one if the topic was taught before this year or mostly taught this year and a value of zero otherwise. Then, I took the average of the 20 indicator variables and expressed it as a percentage of content in the TIMSS assessment taught by this year. 15
Student and Teacher Characteristics
In the analysis that follows, I also used sets of basic student and teacher characteristics as controls. Student characteristics include age, sex, the number of days absent from school, how often English is spoken at home, ever repeated a grade in elementary school, ever repeated a grade in middle or junior high school, special accommodation provided during the mathematics assessment, and students’ confidence in mathematics. Student confidence relates to aptitude, ability to understand instruction, and perseverance in Carroll’s (1963) model. The confidence variable was constructed by the TIMSS project staff based on the Rasch partial credit model. 16 To remove the influence of academic programs and support not provided by the school, I used an indicator variable for extra mathematics lessons and tutoring.
Teacher characteristics include age, sex, highest educational attainment, years of teaching experience, college major, the number of hours spent on professional development in the past 2 years, the number of content areas that professional development covered, and job satisfaction. The job satisfaction variable was created by the TIMSS project staff based on the Rasch partial credit model.
As explained in the method section, I did not use school-level variables. Instead, I used school fixed effects to control for both observable and unobservable school-level factors. Table 1 reports descriptive statistics on students and teachers in the analytic sample.
Descriptive Statistics
Note. SES = socioeconomic status; TIMSS = Trends in International Mathematics and Science Study. An asterisk indicates the scale with 1 being never and 4 being every or almost every lesson. These statistics were estimated based on the analytic sample used for the subsequent analyses. An unconditional approach was used (West et al., 2008). Student and math teacher weights (linear transformation of total student weights) were used to estimate means and standard deviations. Although not reported, standard errors were estimated by the Jackknife repeated replicate sampling variance estimation method. By survey design, statistics on teachers do not represent a teacher population.
Source. U.S. Department of Education, National Center for Education Statistics, TIMSS 2015.
Method
I first estimated raw test score gaps in the reasoning domain by race/ethnicity and SES separately through ordinary least squared regression techniques and then reestimated the gaps including both characteristics in a single linear regression model.17, 18 After that, I reestimated the gaps, controlling for student characteristics. The model takes the following form:
where
Next, to estimate the relationships between the instructional practices and the test scores in the reasoning domain, I estimated a series of regression models that include instructional practices, instructional time, content coverage, student controls, teacher controls, and school fixed effects. I included school fixed effects to remove the effect of observable and unobservable between-school factors that affect both the instructional practices and the test scores. In this sense, I utilized variation in the instructional practices within schools to estimate the associations. 19 The main model takes the following form:
where
Results
Test Score Gaps in Reasoning in Eighth-Grade Mathematics
Figure 3 plots the coefficient estimates and their 95% confidence intervals on the SES and race/ethnicity variables from the four sequential linear regression models described in the method section. Figure 3a plots estimates on the SES quintile variables; Figure 3b plots estimates on the race/ethnicity variables; and the last two figures (c) and (d) plot estimates on the SES quintile variables and the race/ethnicity variables. The estimates in Figure 3d come from Equation 1.

Test score gaps in the reasoning domain.
These figures show sizable test score gaps in all cognitive domains. For example, the raw gaps between the lowest and highest SES students were all close to 1 SD. The gaps still remained sizeable and were close to one half of one SD even after regression adjustments. The raw gaps between Black and White students ranged from 0.80 SD to 0.91 SD and remained in the range of 0.61 SD and 0.70 SD after regression adjustments. The raw gaps between Hispanic and White students were the smallest among the three types of comparisons. The gaps were slightly over one half of one SD and decreased to around 0.30 SD after regression adjustments.
Among the three cognitive domains, gaps in the applying and reasoning domains were generally larger than those in the knowing domain except the Hispanic–White gaps, and this pattern was more pronounced in the Black–White gaps. In Figure 3d, the gap in the knowing domain was 0.61 SD, whereas it was 0.70 SD in the applying domain and 0.66 SD in the reasoning domain.
Associations Between Instructional Practices and Test Scores in the Reasoning Domain
Table 2 presents estimation results on the instructional practices as well as instructional time and content coverage from Equation (2). It also reports results for the knowing and applying domains as references.
Relationships Between Instructional Practices and Test Scores in the Reasoning Domain
Note. Mathematics teacher weights (linear transformation of total student weights) were used for all models. All plausible values were standardized. Standard errors were estimated by the Jackknife repeated replicate sampling variance estimation method combined with an unconditional approach (West et al., 2008). The unconditional approach uses the entire sample to estimate the standard errors. Sample sizes were rounded to the nearest ten due to National Center for Education Statistics nondisclosure policies.
p < .10. **p < .05. ***p < .01.
Source. U.S. Department of Education, National Center for Education Statistics, Trends in International Mathematics and Science Study 2015.
Across the three cognitive domains, engaging instructional practices, same ability group work, and working on problems for which there is no immediately obvious method of solution were positively associated with the test scores. Particularly, the same ability group work had a strong, positive relationship with the test scores. Students who worked in the same ability groups in every or almost every lesson scored 0.85 SD higher than students who never worked in the same ability groups. In sharp contrast, mixed ability group work was negatively correlated with the test scores. Students whose teachers used mixed ability group work in every or almost every lesson scored 0.64 SD lower in the reasoning domain than students who never worked in mixed ability groups.
The coefficient sizes on engaging instructional practices were much smaller than those on the other instructional practices. Since the variable was composed of key instructional practices that underlie many instructional approaches theoretically linked to HOTS, some of the practices might have been positively correlated, whereas the other practices could have been negatively correlated. This pattern might have attenuated the overall positive relationship. The result could also have resulted from the poor quality of the instructional practices, as the data on the practices were collected based on the frequencies of the practices, not the quality. Finally, it may simply mean that some students may learn more from direct instruction, or a combination of direct instruction and student-centered instruction than student-centered instruction alone. In direct instruction, teachers carefully sequence instructions in the appropriate logical order with well-designed examples to help students make the correct inference for a new concept (Stockard et al., 2018). I explored the first possibility and reported results in the online supplement. I found that the hypothesis holds. I could not examine the other two possibilities due to a lack of data.
Another notable pattern is that the coefficient sizes on each of the instructional practices tended to be similar across the three domains. This pattern appears to suggest that HOTS may be content-specific, rather than generic. This means that engaging students in HOT requires content knowledge as well as application skills, resulting in similar coefficient sizes (for the debate about subject specificity of HOTS or critical thinking, see the work by Bailin et al. [1999], Ennis [1987, 1989], Facione [1990], and McPeck [1981].
Outside of the instructional practices, instructional time was negatively associated with the test scores, whereas content coverage was positively related to the test scores. The negative association is counterintuitive; yet, as I explored the relationship by student subgroup in the next subsection, it appears to suggest that this might have resulted from academic tracking.
Subgroup Analysis of the Associations
In this subsection, I analyzed the relationships between the instructional practices and the test scores in the reasoning domain for each student subgroup to identify instructional practices that are associated with the test scores; yet, the discussion focuses on the lowest SES students, and Black and Hispanic students. Table 3 presents the estimation results on each student subgroup from Equation 2. I reported estimation results for the knowing and applying domains in Appendix Tables 5 and 6.
Subgroup Analysis of the Relationships
Note. Q1= first quintile; Q5 = fifth quintile; TIMSS = Trends in International Mathematics and Science Study; SES = socioeconomic status. For instructional practices with an asterisk, the reference category was never, unless noted otherwise in the table. Mathematics teacher weights (linear transformation of total student weights) were used for all models. All plausible values were standardized. Standard errors were estimated by the Jackknife repeated replicate sampling variance estimation method combined with an unconditional approach (West et al., 2008). The unconditional approach uses the entire sample to estimate the standard errors. Sample sizes were rounded to the nearest ten due to National Center for Education Statistics nondisclosure policies.
p < .10. **p < .05. ***p < .01.
Source. U.S. Department of Education, National Center for Education Statistics, TIMSS 2015.
The table shows that the same ability group work was strongly, positively associated with the test scores across all subgroups but Asian students, particularly the lowest SES students and Hispanic students. For example, among the lowest SES students, students who worked in the same ability groups in every or almost every lesson performed 1.76 SD higher than those who never worked in the same ability groups. The performance difference was 1.57 SD for Hispanic students.
The coefficients on the other instructional practices varied from subgroup to subgroup. For example, Black and Hispanic students who worked on problems with teacher guidance in every or almost every lesson had higher test scores than those who never did it; yet, I did not observe such a relationship among the lowest SES students. On the other hand, the lowest SES students who were asked to work on problems for which there is no immediately obvious method of solution in every or almost every lesson scored higher than those who never worked on them; such a relationship was not observed among Hispanic students. For Black students, the relationship was negative. Similarly, mixed ability group work exhibited a negative relationship among Black students but no relationship among the lowest SES students or Hispanic students. I observed similar patterns for engaging instructional practices.
Instructional time was found uncorrelated with the test scores among the lowest SES students and Black students. On the other hand, more instructional time was negatively related to the test scores among Hispanic students. A 1-hour increase in the instructional time was associated with a decline of 0.20 SD in the test score. Content coverage was positively associated with the test scores among the lowest SES students and Hispanic students but not among Black students.
Discussions and Conclusion
The current literature provides a great amount of evidence on test score gaps among student subgroups. However, most of them have focused on subject-level test score gaps, and little attention has been paid to test score gaps in HOTS. This study contributes to the current literature by examining test score gaps in the reasoning domain in mathematics and exploring instructional practices that may be positively associated with the test scores.
I found wide test score gaps in the reasoning domain by SES status and race/ethnicity, even after regression adjustments. The gaps were particularly large between Black and White students, ranging from 0.61 SD to 0.70 SD. Although the direct comparison is not possible due to differences in the methods and grades, these regression-adjusted test score gaps appear to be larger than what previous studies found at the subject level (e.g., Clotfelter et al., 2009; Fryer & Levitt, 2004, 2006). Given the rising demand for HOTS in labor markets, this pattern is concerning and suggests that policy makers and practitioners should elaborate curricula and instructional practices to narrow the gaps.
I explored some instructional practices that prior work suggests may be associated with HOTS. I found that the same ability group work was strongly, positively associated with the test scores for all student subgroups. In contrast, mixed ability group work was negatively associated with the test scores for Black students. No relationship was found among the lowest SES students and Hispanic students. This finding appears to disagree with what researchers suggest regarding ability grouping. Yet, controversy exists regarding the efficacy of same and mixed ability grouping. A recent meta-analysis found that within-class ability grouping has at least small, positive, and significant impacts on student performance (Steenbergen-Hu et al., 2016). Although I cannot explore the mechanism due to a lack of data, Slavin (2011) argues that the type of group goals is important in group work. For example, if the teacher requires a single group product instead of individual products, more able students may do most of the group work. In this group work structure, the same ability group work may have a positive association with the test scores, but mixed ability group work could have a negative relationship. Further exploration with data on the type of group work is necessary to conclude the relationships.
Working on problems (individually or with peers) with teacher guidance was also found positively associated with the test scores among Black and Hispanic students. The TIMSS data do not provide information about what kind of and how teacher guidance was provided to the students; yet, the relationship may still underscore the importance of teacher guidance or scaffolding, which many studies found is positively correlated with student collaboration (e.g., van Leeuwen & Janssen, 2019).
In contrast, working on problems where there is no immediately obvious method of solution was found positively correlated with the test scores among all subgroups but Black and Hispanic students. In particular, the relationship was negative among Black students. Given that working on problems with teacher guidance was positively correlated with HOTS among Black and Hispanic students, this relationship may turn positive when enough scaffolding and teacher guidance are provided.
Although the focus of the second part of the analysis was on instructional practices, instructional time and content coverage are also important factors in student learning. Instructional time was found negatively correlated with the test scores for the all student sample. By student subgroup, it exhibited a positive relationship among the highest SES students and White students, whereas it had a negative relationship among Hispanic students. Although less precisely estimated, the relationship among Black students was also negative. These results may be explained by academic tracking, given that Black and Hispanic students are more likely to be placed in basic classes. That is, the relationship was positive for the highest SES students and White students perhaps because many of them were in advanced mathematics classes; the relationship was negative for Black and Hispanic students perhaps because many of them were placed in basic classes. Content coverage was consistently positively associated with the test scores among all subgroups but Black students. The magnitude of the relationships appears to be smaller than those found among the instructional practices. Contrary to the stated hypotheses earlier, these two factors did not moderate the relationships between the instructional practices and the test scores in a meaningful way (see the online supplement).
It is important to note that, although the subgroup analysis found that engaging instructional practice was generally negatively or insignificantly associated with the test scores, further investigation reported in the online supplement revealed that some practices were positively correlated with the test scores. For example, encouraging students to express their ideas in class and asking students to explain their ideas had positive associations with the test scores among the lowest SES students and Hispanic students. All of these findings were robust to the possibility that students might have been less serious about taking the TIMSS assessment because it was a low-stakes assessment (see the online supplement).
This study faces several limitations. First, although my analysis found wide test score gaps in the reasoning domain and identified instructional practices that were positively associated with the test scores among low-performing subgroups, the method was generally descriptive and did not make a causal inference. Although every effort was made, the estimated coefficients are still subject to potential bias. Second, all instructional practice variables were based on the frequencies, not the quality. Insignificant or negative findings on some instructional practices, therefore, do not necessarily mean that they are not recommended in the classroom. Third, data on instructional practices came from teacher self-reports. Some responses may not have accurately reflected their actual practices and contained measurement error, which might have attenuated the coefficients on some variables. Finally, although not the focus of this study, instructional time did not measure how much time was spent on HOT activities, and the distribution of time across the three cognitive domains was unknown. The coefficients on instructional time as well as its moderation effects (see the online supplement) should be interpreted with caution.
Even with these limitations, this study makes a significant contribution to the current literature and provides the direction for future research. The study benefits from future research that collects more rich data on instructional practices that may be directly related to the instructional practices described earlier. It also benefits from studies that examine test score gaps in HOTS in reading and science in the same and different grades. Finally, research on whether and how HOTS acquired at K–12 systems leads to success in later adult outcomes would help practitioners make seamless the alignment in educational standards and goals from PreK–12 to higher education to career.
Supplemental Material
sj-docx-1-ero-10.1177_23328584211016470 – Supplemental material for Test Score Gaps in Higher Order Thinking Skills: Exploring Instructional Practices to Improve the Skills and Narrow the Gaps
Supplemental material, sj-docx-1-ero-10.1177_23328584211016470 for Test Score Gaps in Higher Order Thinking Skills: Exploring Instructional Practices to Improve the Skills and Narrow the Gaps by Hajime Mitani in AERA Open
Footnotes
Appendix
Subgroup Analysis of the Relationships – Applying
| SES 1st Quintile | SES 5th Quintile | Black | Hispanic | White | Asian | |
|---|---|---|---|---|---|---|
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
| Instructional strategies to improve HOTS | ||||||
| Engaging instructional practice | −0.06 | −0.37* | −0.17 | −0.22*** | −0.12** | −1.04* |
| (0.05) | (0.20) | (0.12) | (0.04) | (0.05) | (0.63) | |
| Small group work - same ability group* | ||||||
| Some lessons | 0.44*** | −0.34 | −0.99** | −0.11 | −0.09 | −3.72*** |
| (0.07) | (0.23) | (0.43) | (0.09) | (0.06) | (0.86) | |
| Almost half the lessons | 0.46*** | 0.09 | −0.88 | 0.32*** | −0.03 | −0.95 |
| (0.11) | (0.13) | (0.76) | (0.03) | (0.09) | (0.88) | |
| Every or almost every lesson | 1.72*** | 1.00*** | 0.25 | 1.47*** | 0.86*** | −4.79** |
| (0.24) | (0.14) | (0.21) | (0.07) | (0.09) | (2.37) | |
| Small group work - mixed ability group* | ||||||
| Some lessons | −0.37 | 0.02 | −1.06** | −0.01 | −0.24** | 1.21*** |
| (0.27) | (0.20) | (0.47) | (0.26) | (0.10) | (0.13) | |
| Almost half the lessons | −0.79** | −0.29* | −0.90 | −0.60*** | −0.52*** | −2.91** |
| (0.36) | (0.16) | (0.64) | (0.23) | (0.10) | (1.21) | |
| Every or almost every lesson | −0.73** | −0.45*** | −1.84*** | −0.53* | −0.56*** | 0.39 |
| (0.36) | (0.17) | (0.62) | (0.29) | (0.10) | (0.57) | |
| Work problems with teacher guidance* | ||||||
| Some lessons | −0.18 | Reference | Reference | Reference | 0.10 | Reference |
| (0.81) | (0.43) | |||||
| Almost half the lessons | 0.39 | 0.68*** | −0.26 | 0.45** | 0.69 | 3.22** |
| (0.87) | (0.18) | (0.35) | (0.19) | (0.44) | (1.31) | |
| Every or almost every lesson | 0.44 | 0.76*** | 0.38 | 0.50*** | 0.73* | −0.22 |
| (0.86) | (0.08) | (0.25) | (0.15) | (0.44) | (0.49) | |
| Work problems with no immediately obvious method of solution* | ||||||
| Some lessons | −0.21* | −0.18 | −0.31* | −0.12** | 0.12* | −0.12 |
| (0.11) | (0.30) | (0.17) | (0.05) | (0.07) | (0.63) | |
| Almost half the lessons | −0.17 | −0.22 | −0.26 | 0.11 | −0.02 | 3.62* |
| (0.14) | (0.28) | (0.16) | (0.15) | (0.05) | (1.96) | |
| Every or almost every lesson | 0.47*** | 1.20*** | −2.02*** | 0.38* | 0.65*** | 4.34** |
| (0.13) | (0.34) | (0.73) | (0.21) | (0.10) | (2.15) | |
| Instructional time | ||||||
| Total hours spent on math instruction per week | 0.02 | 0.12* | −0.10** | −0.19*** | 0.06*** | 0.23 |
| (0.04) | (0.07) | (0.04) | (0.03) | (0.02) | (0.26) | |
| Content coverage | ||||||
| Percent of content in the TIMSS assessment taught by this year | 0.03*** | 0.04*** | 0.01 | 0.02*** | 0.02*** | 0.15*** |
| (0.00) | (0.00) | (0.00) | (0.00) | (0.00) | (0.05) | |
| Constant | −5.85*** | −0.43 | −3.40*** | −2.09*** | −2.97*** | −9.56* |
| (0.98) | (2.34) | (1.30) | (0.62) | (0.59) | (5.68) | |
| Student controls | X | X | X | X | X | X |
| Teacher controls | X | X | X | X | X | X |
| School fixed effects | X | X | X | X | X | X |
| Observations | 1,130 | 1,370 | 740 | 1,700 | 3,130 | 280 |
Note. For instructional practices with an asterisk, the reference category was never, unless noted otherwise in the table. Mathematics teacher weights (linear transformation of total student weights) were used for all models. All plausible values were standardized. Standard errors were estimated by the Jackknife Repeated Replicate sampling variance estimation method combined with an unconditional approach (West et al., 2008). The unconditional approach uses the entire sample to estimate the standard errors. Sample sizes were rounded to the nearest ten due to NCES non-disclosure policies.
p<0.10, ** p<0.05, *** p<0.01.
Source: U.S. Department of Education, National Center for Education Statistics, Trends in International Mathematics and Science Study (TIMSS) 2015.
Author Note
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Notes
Author
HAJIME MITANI, is an assistant professor of educational leadership at Rowan University. His research interests include test score gaps, critical thinking skills, leadership skill requirements, leadership skill development, educator and leader preparation, education policy analysis, and program evaluation.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
