Abstract
This study develops a systematic, theory-driven evaluation index system to address the critical gap in assessing the implementation effectiveness of China’s “Double Reduction” policy. Grounded in Policy Implementation Theory and Educational Policy Evaluation Frameworks, the study employed a mixed-methods approach involving policy text analysis of five core documents and a two-round Delphi consultation with 23 experts (authority coefficient Cr = 0.7928). Index weights were determined through comprehensive weighting (AHP and Entropy Weight Method), and the system was empirically validated via fuzzy comprehensive evaluation with data from 163 teachers across four Chinese provinces. The resulting framework comprises 2 first-grade, 5 second-grade, and 23 third-grade indexes. Empirical results indicate an overall “good” policy implementation effect (score = 72.96/90), with “Quality of Education and Teaching” (weight = 0.2856) and “Level of After-School Services” (weight = 0.2269) as the most influential indicators. However, weaker performance was observed in reducing homework burden and promoting balanced compulsory education. This research provides the first unified evaluation tool for the “Double Reduction” policy, offering both a practical instrument for local Chinese governance and a replicable model for other regions grappling with the challenge of balancing student burden reduction and educational quality improvement.
Plain Language Summary
China’s “Double Reduction” policy, started in 2021, aims to cut students’ heavy homework and unregulated after-school tutoring. This study created a clear system to check how well the policy works, using expert advice and surveys of teachers in four provinces (Jiangsu, Shandong, Shanxi, Anhui). Overall, the policy has done well (scoring 72.96 out of 90), with better school teaching quality and after-school services being the biggest wins. However, there are gaps: too much homework is still a problem, and compulsory education isn’t equally good across areas. Also, after-school services need better conditions and more options. This study helps Chinese local governments improve the policy and gives other countries (like those with big tutoring industries) a model to check their own student burden-cutting efforts. It uses simple, fair methods to measure success, making it easy to understand even for people not in education research.
Keywords
Introduction
In recent years, the excessive burden of homework and extracurricular academic tutoring on primary and secondary school students in China has distorted the goals of education reform, fostering short-sighted and utilitarian approaches to learning while triggering widespread social anxiety around education (Huang et al., 2021; Qi et al., 2023). Because of the convergence of parents’ parenting concepts, methods and behaviors around the world, the anxiety caused by excessive involvement in children’s education has spread globally, especially among middle-class parents (Ehrenreich, 1990). To address this crisis, in July 2021, the General Office of the Central Committee of the Communist Party of China (CCCPC) and the State Council issued the Opinions on Further Easing the Burden of Excessive Homework and Off-campus Tutoring for Students Undergoing Compulsory Education—widely known as the “Double Reduction” policy. Critically, the policy’s core objective is to reduce two key sources of student overload: excessive homework assigned by schools and unregulated off-campus academic tutoring (Zheng & Zhou, 2021). It has the value essence of student-centered, quality-based education and home-school cooperative education. (Xue & Li, 2023). To achieve this, it employs targeted instruments: strengthening school-based after-school service programs (e.g., extended hours, skill-building activities) to replace off-campus tutoring, regulating the registration, fees, and advertising of tutoring institutions, and reforming classroom teaching to improve in-school efficiency (She et al., 2022). By 2024, China’s Ministry of Education reported that 3 years of policy implementation had achieved “Double Reduction and double increase” (i.e., reduced burden alongside improved educational quality and after-school services) and pledged to consolidate these outcomes. General Secretary Xi Jinping further emphasized the need to sustain progress at the National Education Conference, underscoring the policy’s role as a cornerstone of China’s efforts to build a modern, Chinese-style education system.
Establishing a rigorous evaluation index system for the “Double Reduction” policy is of utmost importance for scientifically appraising its implementation, optimizing policy measures, and enhancing public confidence. Nevertheless, extant academic research on the policy’s effectiveness is plagued by two crucial limitations, which mirror more extensive deficiencies in global educational policy evaluation. At the domestic level, studies predominantly rely on “self-constructed evaluation scales” centered on single perspectives (e.g., parental satisfaction or homework quantity). There are no unified standards to elucidate the hierarchical relationships among indexes (L. Zhou, 2023). This disorganization frequently confounds process variables (e.g., “implementation organizational quality”) with outcome variables (e.g., “target group perception”), resulting in inconsistent and non-comparable evaluations.
On a global scale, existing frameworks for educational policy evaluation, such as the OECD’s emphasis on equity and student well-being in its Education Policy Outlook (OECD, 2022) and the U.S. Every Student Succeeds Act (ESSA)’s stress on standardized test performance and school accountability, present a significant shortcoming. These models are deficient in specialized instruments for evaluating “burden reduction” policies in non-Western, developing, or culturally diverse contexts (Z. Zhou et al., 2023). For example, OECD frameworks give precedence to overall educational quality rather than addressing the distinctive problem of “over-tutoring” (a prevalent phenomenon in East Asia, South Asia, and certain regions of Southeast Asia). Meanwhile, ESSA’s accountability metrics fail to consider the role of unregulated private tutoring in intensifying educational inequality. This global lacuna implies that policymakers in non-Western settings lack adaptable and culturally appropriate models to assess policies aimed at alleviating student overload, thereby impeding the replication of successful burden-reduction initiatives.
Against this backdrop, this study anchors its design in Policy Instrument Theory—a framework that examines how policy tools (e.g., regulation, service provision, collaboration) shape implementation outcomes (Howlett, 2009). By applying this theory, we move beyond descriptive evaluations of the “Double Reduction” policy to explain how its instruments (e.g., tutoring regulation, after-school services) interact with local contexts to influence effectiveness—a contribution to localized policy evaluation models that can inform non-Western contexts. Methodologically, we first code and analyze five core “Double Reduction” policy texts (see Table 1) to develop a systematic, theory-driven evaluation index system. We then use the fuzzy comprehensive evaluation method to empirically assess policy effectiveness across four Chinese regions (Jiangsu, Shandong, Shanxi, and Anhui), leveraging primary and secondary school teachers as key informants (they are direct implementers and witnesses to student-level changes).
Selected Policy Texts.
The significance of this study is two-pronged. In practical terms, it furnishes Chinese local governments with a scientific instrument for the refinement of the “Double Reduction” policy. From a global perspective, it presents a replicable model for the evaluation of burden-reduction policies in non-Western contexts, thereby filling the void in existing international frameworks. Through these efforts, the study aims to make contributions to both China’s educational reform and the global endeavors to rebalance education in favor of students’ well-being and equitable development.
Literature Review
Policy Connotation of the “Double Reduction,” Policy: Domestic Evolution and Global Context
Understanding the connotation of the “Double Reduction” policy is a fundamental prerequisite for constructing its evaluation index system. Domestically, research focus on this connotation has shifted from examining short-term implementation effects to advancing systematic policy evaluation. In the early stages post-policy launch (2021), scholars emphasized its strategic significance as an effort to reset educational paradigms—framing it as a breakthrough in addressing “burden reduction as a focal goal” while reflecting policy legitimacy and practical innovation (Wang, 2021; X. J. Zhang et al., 2023). For instance, Wang (2021) argued that the policy aligns with educational equity principles by curbing utilitarian tutoring, while X. J. Zhang et al. (2023) highlighted its role in rebalancing school, family, and social responsibilities in education. As implementation progressed, Zhu (2021) expanded this discourse by proposing a “symptom-root cause” evaluation lens: symptoms include reduced student burden and improved in-school efficiency, while root causes target structural issues like regional educational inequity—a distinction that guided early domestic evaluation attempts.
From a global perspective, the “Double Reduction” policy aligns with international efforts to mitigate educational overload, yet its design reflects unique contextual characteristics. For example, South Korea’s 2000s cram school (hagwon) regulations focused narrowly on limiting tutoring hours and standardizing fees to reduce student stress, but lacked China’s emphasis on strengthening in-school alternatives (e.g., after-school services) to replace off-campus tutoring (Choi & Choi, 2016). In contrast, Finland’s “less-is-more” reforms centered on curriculum streamlining and student-centered learning to reduce academic pressure, prioritizing qualitative educational improvement over direct regulation of private services (Sahlberg, 2021). These global cases highlight that burden-reduction policies typically prioritize either “supply-side regulation” (e.g., South Korea) or “in-school quality enhancement” (e.g., Finland). China’s “Double Reduction” is distinctive in integrating both: it targets both excessive homework (in-school) and unregulated tutoring (out-of-school) while linking burden reduction to compulsory education quality—filling a gap between single-focus global models.
Evaluation Dimensions of Educational Burden-Reduction Policies: International Models and Domestic Gaps
Clarifying effective evaluation dimensions is critical for assessing policy impact, yet both domestic and international research exhibit notable limitations. Internationally, classic evaluation models have guided educational policy assessment but struggle to address the unique demands of burden-reduction policies. The CIPP Model (Context-Input-Process-Product; Stufflebeam, 2003), a widely used framework for educational evaluation, emphasizes systematic assessment of policy context, resources, implementation processes, and outcomes. However, its process-oriented design prioritizes “whether policies are implemented as planned” over “whether burden reduction improves student well-being or equity”—a core goal of policies like China’s “Double Reduction.” Similarly, Kirkpatrick’s Four-Level Evaluation Model (Kirkpatrick & Kirkpatrick, 2016), which evaluates training programs via reaction, learning, behavior, and results, is ill-suited for large-scale educational policies: it focuses on individual participant outcomes (e.g., student test scores) rather than systemic changes (e.g., regional educational balance) that define burden-reduction success.
Domestic research on “Double Reduction” evaluation dimensions, while evolving, remains fragmented. Early studies focused on isolated indicators: Qi et al. (2023) measured homework burden reduction via student survey data, while Yang (2023) assessed tutoring regulation effectiveness through institutional compliance rates. Zhu (2021) expanded this to include “symptom-root cause” dimensions but did not integrate these into a unified framework. A key gap is that domestic studies rarely engage with international models to address their own limitations—for example, no domestic research has adapted the CIPP Model’s “product” dimension to measure holistic outcomes (e.g., parental anxiety reduction, student well-being) or addressed Kirkpatrick’s blind spot in systemic evaluation. This disconnect leaves domestic evaluations without a globally informed, comprehensive lens for assessing the “Double Reduction” policy’s multi-faceted impact.
Methodological Limitations in “Double Reduction,” Evaluation: Domestic Shortcomings and International Misalignments
The methodological rigor of policy evaluation directly impacts result validity, yet both domestic and international research on educational burden reduction faces challenges. Domestically, studies rely heavily on qualitative methods or self-constructed quantitative scales with limited standardization. Ye (2023) used in-depth interviews with primary school teachers to analyze classroom teaching changes under “Double Reduction,” while Liu (2023) employed questionnaires to measure tutoring demand—but neither study validated their tools against international standards or unified metrics. L. Zhou (2023) criticized this approach, noting that conflating process variables (e.g., implementation form) with outcome variables (e.g., target group perception) creates chaotic, incomparable results. Additionally, domestic studies rarely use comprehensive weighting or fuzzy evaluation methods to handle subjective data (e.g., parental satisfaction), leading to oversimplified conclusions about policy effectiveness.
Internationally, methodological limitations in burden-reduction policy evaluation stem from a misalignment with non-Western contexts. Studies on South Korea’s hagwon regulations (Bae & Choi, 2024) used quantitative data (e.g., tutoring participation rates) to measure success but ignored cultural factors like parental pressure to pursue supplementary education—factors critical to understanding policy impact in East Asia. Evaluations of Finland’s reforms (Sahlberg, 2021) relied on qualitative classroom observations, which are difficult to replicate in large, diverse education systems like China’s. A global methodological gap is the lack of tools that balance quantitative rigor with sensitivity to cultural context—particularly for policies that, like China’s “Double Reduction,” target both structural regulation and cultural shifts in educational expectations. This gap underscores the need for mixed-method approaches that integrate international best practices (e.g., Delphi expert validation) with context-specific adaptations (e.g., fuzzy comprehensive evaluation for subjective data).
Methods
Selection of Texts
Three criteria guided the selection of “Double Reduction” policy texts: authority (central government-issued documents with national dissemination), professionalism (compiled by expert teams from the Ministry of Education), and symbolism (reflecting public opinion via democratic decision-making). Five core texts were selected (Table 1), with additional considerations: (1) alignment with the policy’s official 2021 launch timeline; (2) text 1 as the overarching framework, text 2 as a focus on tutoring governance, and texts 3 to 5 as post-implementation effectiveness references to inform index design.
The five policy texts were imported into NVivo 12 for systematic coding, following three sequential steps to ensure rigor. Open coding: Abstracting core concepts via keywords (e.g., “homework management”“tutoring supervision”) to generate 47 initial concepts; Axial coding: Consolidating highly similar initial concepts into 23 categories (e.g., “homework amount”“after-school service quality”) by identifying causal and associative relationships; Selective coding: Integrating 23 categories into five main categories (“total homework amount and duration”“normative degree of tutoring,”“quality of education and teaching,”“degree of collaborative linkage,”“level of after-school services”) and distilling two core categories (“degree of ‘Double Reduction’”“effectiveness of supporting governance”) to form the three-tier index system (2 first-grade, 5 second-grade, 23 third-grade indexes; Table 3).
Coding consistency was verified using Cohen’s Kappa coefficient (Kappa = 0.87,
Basic Information on Experts
According to the requirements of this study and considering the interests associated with the “Double Reduction” policy, education scholars, heads of education management departments, school administrators, and parents were selected as expert sources for the research. The evaluation of the policy indexes is highly specialized; therefore, the study developed a detailed plan for expert selection to ensure the authority, scientific rigor, and validity of the results. Education scholars were chosen based on their senior academic titles, and all possess doctoral degrees. Additionally, several experts have led national key projects as well as provincial and ministerial initiatives related to the “Double Reduction” policy and have published academic findings on relevant topics. Other experts specialize in family education, school governance, and educational evaluation, all of whom can contribute valuable insights to the assessment of the “Double Reduction” policy. The heads of city and county-level education bureaus and the directors of education supervision offices were selected from the education management sector. Grassroots education departments are the primary entities responsible for implementing and evaluating the evaluation serve as significant references. Schools are the main implementers of the “Double Reduction” policy and are directly involved in assessing its effectiveness. Parents are the primary beneficiaries of the “Double Reduction” policy, and selecting highly educated parents can provide a more informed perspective for evaluating the policy. A total of 23 expert consultation letters were distributed, all of which yielded valid responses. Table 2 presents the information of experts. Experts completed online targeted questionnaires (100% effective response rate). The high response rate was justified by: (1) pre-consultation communication to confirm expert availability; (2) providing a concise policy brief to reduce response burden; (3) 2-week response windows with one reminder. Expert authority was validated via Cr = 0.7928 (exceeding the acceptable Cr ≥ 0.70; Zeng Guang’s standard), ensuring index validity.
Basic Information on Experts (
Empirical Measurement Sample Selection
In order to assess the accuracy and feasibility of the evaluation index system for the implementation effectiveness of the “Double Reduction” policy, this study uses Jiangsu (high economic development, top-tier compulsory education quality), Shandong (large education scale, balanced urban-rural development), Shanxi (mid-income province, ongoing educational resource optimization), and Anhui (rapidly developing, with regional disparities in education access) as case studies. The fuzzy comprehensive evaluation method (C. Zhang et al., 2020) is employed to select primary and secondary school teachers from these four regions as the subjects of the survey, aiming to measure the effectiveness of the “Double Reduction” policy’s implementation. Primary and secondary school teachers, as the direct implementers of the policy and the primary contacts with students, possess firsthand insights into the realities of the policy’s implementation and the resultant changes in students. Therefore, selecting primary and secondary school teachers as the survey subjects is both scientifically valid and rational. 166 primary/secondary school teachers (direct policy implementers) were surveyed online, yielding 163 valid responses (98.19% effective rate). Pre-survey reliability was tested via Cronbach’s α = .82, indicating high internal consistency of questionnaire items (Nunnally, 1978).
Ethical Approval and Procedures
This study was approved by the Institutional Review Board of the author’s University. All procedures were performed in accordance with the 1964 Helsinki Declaration and its later amendments. The study design minimized risks to participants by ensuring anonymity (no personally identifiable information was collected), allowing withdrawal at any time without penalty, and avoiding sensitive topics. The potential benefits of developing a systematic policy evaluation framework for the “Double Reduction” policy were deemed to outweigh the minimal risks to participants. Informed consent was obtained from all participants prior to their involvement. For experts, educational management staff, and teachers, consent was obtained electronically. For parent participants, verbal consent was obtained and documented, with all participants being fully informed of the study’s purpose, procedures, and data usage.
Results
Construction of the Evaluation Index System
Through NVivo 12 coding (open → axial → selective) and Delphi expert validation, a three-tier evaluation index system was finalized. Hierarchical Structure Diagram illustrates the hierarchical relationships between core categories (first-grade indexes), main categories (second-grade indexes), and subcategories (third-grade indexes)—replacing Table 3 for clearer visualization. The system comprises 2 first-grade indexes (“Degree of ‘Double Reduction’”“Effectiveness of Supporting Governance”), 5 second-grade indexes, and 23 third-grade indexes. Coding consistency was confirmed via Cohen’s Kappa = 0.87 (
Final Coding Results.
Results of Expert Advice
In this study, the expert positivity coefficient was set at “1,” indicating a high level of concern and positive attitude among experts in this field. Based on the expert authority coefficient assignment method proposed by Prof. Zeng Guang, the expert authority coefficient Cr was calculated to be 0.7928. Since the acceptable value is Cr ≥ 0.70, it can be inferred that the expert authority degree of the consultation in this study is relatively high, and experts are capable of better evaluating the index system.
In this research, a five-point Likert scale, ranging from 5 to 1, was utilized to evaluate the importance of various indexes. This approach enabled the calculation of both the degree of consensus among experts’ opinions and the level of agreement on those opinions. Moreover, open-ended questions were incorporated into the correspondence questionnaire, allowing experts to propose modifications to irrelevant existing indexes and provide additional comments on missing indexes.
Following the consultation with experts, the analysis of the questionnaire data demonstrated that, according to expert opinions, the mean importance of the analyzed indexes exceeded 3.5. Additionally, over 95% of the indexes had a full-score ratio greater than 20%. The coefficient of variation (CV) of the analyzed indexes showed that more than 90% of the indexes had a CV of less than 0.25. Furthermore, the
Expert Open-Ended Comments.
Weighting Results
Determining Subjective Weights
This study adopts the analytic hierarchy process (AHP; Saaty, 1980) to calculate the subjective weights of indexes at all levels, which decomposes the influencing elements related to decision-making into different levels, and carries out qualitative and quantitative analyses on the basis of this method. The process of subjective weighting of the second-grade indexes is illustrated as follows:
First, according to the evaluation index system constructed above, combined with the results of the experts’ correspondence to compare the relative importance of each index at each level, based on the ninefold scaling method to construct the judgment matrix at each level, Table 5 presents the details of this method.
Ninefold Scaling Method.
Using Likert five-level scale to collect the evaluation of the degree of importance of each level of indexes by experts, the two indexes for comparison, 5 points divided into 10 intervals, two indexes rating is equal, then the two scaled to 1; two indexes between the absolute value of the rating difference between the two indexes in the (0, 0.5], the two in the important indexes scaled to 2, the other 1/2; the absolute value of the rating difference between the two indexes in the (0.5, 1], the important index in both is scaled as 3 and the other as 1/3. and so on to construct the subjective weight judgment matrix
Where
The second is the consistency test. Using Matlab 2021, the matrix
From the formula
Conduct consistency tests:
Where
Table of RI Values.
As can be seen from the calculation, the judgment matrix of subjective weights of second grade indexes has good consistency and meets the consistency requirements. Table 7 presents the subjective weights of the second-grade indexes of the evaluation index system of the effectiveness of the implementation of the “Double Reduction” policy are obtained. According to the above method, the subjective weight judgment matrix of each second-grade indexes and its subordinate third grade indexes is constructed separately to calculate the subjective empowerment results of the third-grade indexes.
Results of Subjective Weighting of Second Grade Indexes.
Determination of Objective Weights
The entropy weight method is used for objective weight determination, which is based on the different entropy values to assign weights, and since the weight determination depends on the data set, the entropy weight method can effectively avoid the influence of subjectivity and objectively reflect the characteristics of the weighted data. The process of objective weighting of second grade indexes is explained as follows:
First, the 23 expert ratings were transformed into a scoring matrix
Second, according to the formula of information entropy
Finally, according to the
Results of Objective Weighting of Second Grade Indexes.
Similarly, the objective weights of each second-grade index and its subordinate third grade indexes can be calculated according to the above methodology.
Determination of Combined Weights
Combining the analytic hierarchy process (AHP) and the entropy weight method to implement comprehensive weight allocation for evaluation indexes can effectively circumvent the limitations imposed by a single assignment method (Jiang et al., 2024). The Lagrange multiplier method is used to determine the comprehensive weight, and the subjective weights α and objective weights β are combined and assigned, which can reflect the importance of the indexes and make the weight value of each index more objective and reasonable. Expanding the illustration with the process of comprehensive weighting of second grade indexes, the steps are as follows:
One is from the Lagrange multiplier formula
The same reason for the above formula can be calculated to obtain the composite weight of the third grade indexes.
Second, based on the comprehensive weights of the second-grade indexes and the formula
Third, based on the above results and the formula
Finally, according to the comprehensive weights of the second grade indexes, the formula
The comprehensive weights of the above second and third grade indexes are summarized and ranked according to the weights of the indexes, and the weights of the index system for evaluating the effectiveness of the implementation of the “Double Reduction” policy are finally obtained, as Table 9 shows.
Index Weights by Level.
Empirical Measurement Results
Calculate the Comprehensive Evaluation Vector
The results of the questionnaires were analyzed by considering the four regions as a whole, determining the set of fuzzy comprehensive evaluation indexes according to the constructed evaluation index system, and designing five levels of evaluation for each index, that is,
In the above equation, the
Similarly, the affiliation vectors of other third grade indexes under the second grade index “total homework amount and duration” are obtained as follows
Where
Therefore, according to the formula
Where
Since the affiliation vector of the second-grade indexes is the comprehensive evaluation vector of the second-grade indexes, the above evaluation vectors are combined to form the affiliation matrix (
Therefore, according to the formula
Using the same methodology, a composite evaluation vector
Because the first-grade indexes affiliation vector is the comprehensive evaluation vector of the first-grade indexes, the affiliation matrix (
Therefore, according to the formula
Calculation of a Comprehensive Evaluation Score
After the comprehensive evaluation vector
Then, according to the principle of weighted average, by the formula
Third Grade Indexes Scores.
Discussion
Empirical results from four Chinese provinces (Jiangsu, Shandong, Shanxi, Anhui) show the “Double Reduction” policy’s overall implementation effect is “good” (comprehensive score = 72.96/90, ≥70 threshold with 95% CI: 71.23–74.69). Two second-grade indexes—“quality of education and teaching” (weight = 0.2856) and “level of after-school services” (weight = 0.2269)—are the most influential. However, critical gaps persist: only 38% of respondents agreed with homework amount reductions, 36% with balanced compulsory education development, and 35% with diversified after-school service channels (scores < 70).
The study focuses on “after-school program quality” aligns with the OECD’s (2022) Education Policy Outlook, which identifies “student well-being” as a core metric for educational policy success. China’s emphasis on integrating after-school services with burden reduction extends this framework by linking service quality to reducing reliance on private tutoring, a challenge understudied in OECD contexts. Additionally, the index “degree of balanced high-quality development of compulsory education” resonates with UNESCO’s “Education for All” goals (UNESCO, 1990), but the study addresses a global gap: while UNESCO prioritizes equity in access, this research quantifies equity in burden reduction—a unique dimension for non-Western countries with heavy tutoring cultures (e.g., India, South Korea).
This study advances policy evaluation theory by proposing a three-dimensional evaluation framework (burden reduction degree, governance effectiveness, sustainability readiness)—a novel adaptation for non-Western contexts. Unlike Western models (e.g., CIPP Model) that prioritize process compliance, this framework centers on “effect-oriented” assessment (e.g., linking tutoring regulation to actual student burden reduction) and embeds “sustainability” (via indexes like “home-school collaboration”). This contributes to localizing Policy Instrument Theory, as it demonstrates how policy tools (e.g., after-school services, homework inspection) interact with cultural contexts (e.g., parental expectations of tutoring) to shape outcomes—filling a gap in global literature on non-Western policy evaluation.
Beyond guiding Chinese local governments, the findings offer actionable insights for emerging economies balancing burden reduction and quality improvement: for countries with large tutoring markets (e.g., South Korea, Turkey), prioritize “normative degree of tutoring” (weight = 0.1469) and link regulation to school-based alternatives (e.g., extended after-school programs); for economies with regional educational disparities (e.g., Brazil, Indonesia), replicate the “balanced high-quality development” index to monitor equity in burden reduction, supported by digital tools (e.g., big data education monitoring platforms). That said, this study has limitations: data from only four Chinese provinces may not capture urban-rural gaps (e.g., rural schools’ limited after-school service resources) or regional variations in tutoring cultures, and fuzzy comprehensive evaluation relies on subjective teacher ratings, which may underrepresent parent/student perspectives. Future research could address these gaps by conducting cross-cultural studies (e.g., comparing China’s “Double Reduction” with India’s tutoring regulations) to test the framework’s generalizability, or using longitudinal data (3–5 years) to assess policy sustainability—examining whether short-term burden reduction translates to long-term educational quality improvement.
When interpreting these findings, several methodological limitations should be considered. The primary reliance on teachers’ self-reported data poses a risk of social desirability and recall biases, potentially affecting the accuracy of ratings on issues like after-school service quality or homework duration. Furthermore, the absence of objective verification indicators (e.g., official school records on homework, tutoring institution registration data) or triangulation with data from students and parents means the findings rely solely on subjective perceptions. Finally, while the four-province sample offers valuable regional insights, it may not fully capture the significant urban-rural disparities and regional variations in tutoring cultures across China, limiting the immediate generalizability of the results. Future research should therefore prioritize cross-cultural comparative studies (e.g., comparing China’s “Double Reduction” with similar regulations in countries like India or South Korea) to test the framework’s global applicability. Additionally, employing longitudinal designs and mixed-methods approaches that incorporate objective data and the perspectives of students and parents would be crucial to validate these initial findings and assess the long-term sustainability of the policy’s effects.
Conclusion
This study has developed and preliminarily validated a unified, theory-driven evaluation system for China’s “Double Reduction” policy. While the findings are constrained by the methodological limitations of reliance on self-reported data and a regional sample, as discussed, they nonetheless make three significant contributions to the field of educational policy evaluation.
First, it constructs the first systematic evaluation index system for “Double Reduction” implementation effectiveness, comprising 2 first-grade, 5 second-grade, and 23 third-grade indexes. This system resolves the issue of fragmented “self-constructed scales” in domestic research and provides a structured tool for holistically measuring both burden reduction and the effectiveness of supporting governance mechanisms.
Second, it demonstrates the practical utility of the fuzzy comprehensive evaluation method in the context of educational policy assessment. By integrating subjective (AHP) and objective (entropy weight) weighting techniques, the method effectively handles the inherent fuzziness of educational outcomes, offering a replicable approach for evaluating policies with complex, multi-stakeholder impacts.
Third, the study provides a valuable cross-cultural reference for global burden-reduction policy evaluation. The proposed three-dimensional framework (burden reduction, governance, sustainability), grounded in the empirical context of China—an emerging economy with a massive shadow education sector—helps to fill the gap in non-Western policy evaluation models.
In light of its contributions and acknowledged limitations, this research serves as a critical foundation for future inquiry. It advances the localization of policy evaluation theories in non-Western contexts and ends with a clear call for more robust, longitudinal, and cross-cultural research to further refine the tools for assessing educational burden-reduction policies worldwide.
Footnotes
Ethical Considerations
This study was approved by the Institutional Review Board of the authors’ university. All procedures complied with the ethical standards of the 1964 Helsinki Declaration and its subsequent amendments, as well as guidelines for educational research involving human participants. Participant privacy and data security were strictly maintained, with no collection of personally identifiable information.
Consent to Participate
Informed consent was obtained from all participants prior to their involvement. Participants were fully informed of the study’s purpose, methods, duration, and data usage. They were also advised of their right to withdraw at any time without consequence. Consent was obtained electronically for expert and teacher participants, and verbally confirmed for parent participants.
Author Contributions
Yadong Ding contributed to the overall conception, design of the manuscript, literature search, and article writing. Bing Zhou contributed to data collection, processing and analysis, and Jing Li contributed to the planning, design and implementation of the entire study. All authors contributed to the article and approved the submitted version.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Education Sciences Planning of China (Grant No. CKA250315).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
