Sage Journals: Discover world-class research

Abstract

This study examines the development of the ChatGPT-4o model developed by OpenAI to solve university-level probability problems over 8 months. The study is a descriptive study based on the general screening model. Within the scope of the study, 38 probability problems with four different grades and different difficulty levels [easy, medium, and hard] were modeled, and the answers of ChatGPT-4o were evaluated in terms of accuracy and consistency of the process steps at 8-month intervals. The difference in success between the problem-solving measurements of ChatGPT-4o was not only statistically significant but also practically remarkable. The problem-solving skill of ChatGPT-4o exhibited a statistically significant performance increase at different difficulty levels over time. Furthermore, when all levels were considered collectively, the general success rate improved substantially. The findings reveal that the model significantly increased performance, especially in problems with conditional probability and union/intersection events. At the same time, improvements were observed in the model’s ability to present the solution process in a more explanatory and systematic way. The results reveal ChatGPT-4o’s potential for reasoning during problem-solving processes and demonstrate how this capability has evolved over time. Overall, the integration of ChatGPT-4o models into mathematics education may support students’ fundamental problem-solving skills and enable teachers to allocate classroom time to more complex tasks. However, the model’s limitations should not be overlooked. In this context, further comprehensive research is needed to examine the effects of artificial intelligence–based systems on instructional processes, ensuring their effective, reliable, and pedagogically appropriate integration into mathematics teaching.

Keywords

ChatGPT-4o artificial intelligence mathematics education probability problem-solving processes

Introduction

Nowadays, artificial intelligence (AI) is increasingly coming to the forefront with its potential to exert significant impacts across various disciplines, particularly in the field of education (Sposato, 2025). In the context of mathematics education, AI technologies bring with them various advantages, such as personalized teaching, interactive learning environments, and real-time feedback, but also important problems such as being misleading, weakening creative thinking and problem-solving skills, opacity of reasoning processes, algorithmic biases, lack of emotional intelligence, and data privacy and security (O. A. Opesemowo & Ndlovu, 2024). However, over the last year, ChatGPT-4o, a model developed by OpenAI, has emerged as an artificial intelligence tool that continuously improves its mathematical problem-solving skills and can provide more efficient solutions in new areas (O. A. G. Opesemowo et al., 2024).

The concept of “intelligence” in AI refers to a machine that can solve mathematical problems. This means that the machine can produce its own solution instead of having pre-coded solutions. Therefore, AI investigates methods that enable computers to acquire the thinking, learning, and self-development abilities that can be achieved by human intelligence, aiming to design machines that can solve problems considered difficult by humans (Noster et al., 2024). One of these purposes, mathematical thinking, involves considering the world from a mathematical perspective, which entails modeling, using symbols, and applying mathematical concepts to various situations (Yazgan, 2007). For mathematicians, mathematics is a unique way of thinking that leads us to truth and definitive knowledge. For thought to occur, a problem must first be defined. To solve the problem, connections are established between concepts, a solution is sought, and thinking plays a role in this process. The ability to think mathematically develops in the problem-solving process (Ersoy & Guner, 2015). In other words, problems should be a starting point for mathematical thinking.

Artificial intelligence can support students’ problem-solving and mathematical thinking skills in mathematics education (Pinheiro & Santos, 2025). In understanding complex topics, it can provide personalized feedback that guides students’ cognitive processes and helps them identify their learning gaps. In this way, students gain opportunities to apply mathematical concepts in different contexts and to develop logical solutions to multi-step problems (Luong et al., 2025). Artificial intelligence tools, such as ChatGPT-4o, can support these processes and enhance the learning environment in mathematics education (İnci Kuzu, 2026).

Identifying the development of problem-solving skills supported by artificial intelligence is crucial for understanding the effectiveness of AI-assisted tools that can enhance students’ learning processes in mathematics education. Such an evaluation reveals the role artificial intelligence plays in students’ problem-solving processes across problems of varying difficulty levels and enables a more informed and effective use of AI in education. Therefore, monitoring progress in problem-solving skills facilitated by artificial intelligence is crucial for enhancing the effectiveness of teaching and learning processes in mathematics education.

Theoretical and Conceptual Framework

With the acceleration of digital transformation, artificial intelligence–based applications are becoming more visible and influential in the field of education (Kuzu & Erdoğan Kayabaşı, 2025). While the capacity of these technologies to transform teaching and learning processes is noteworthy, claims suggesting that the possibilities offered by artificial intelligence are unlimited should be approached with caution. Although AI tools present significant opportunities, their potential should be evaluated within realistic boundaries and in alignment with pedagogical requirements (Zhang & Aslan, 2021).

The development of artificial intelligence began in the 1960s, when Skinner’s “teaching machines” project introduced programmed lessons and automatic feedback, marking a significant milestone in the application of artificial intelligence in education (Watters, 2023). The 1970s saw a focus on research and development in artificial intelligence, student modeling, and adaptive learning systems (Kolhatin, 2025). In the 1980s, expert systems began to be utilized in education, providing information and guidance in specific areas (Gülleroğlu & Coşkun, 2021). The rapid development of the Internet, wireless network communications, and computer technologies has also accelerated the use of AI-supported classroom applications (Koukaras et al., 2025). The rise of the Internet in the 1990s provided new access channels and multimedia content for learning. In comparison, virtual learning environments and online courses gained widespread adoption in the 2000s, offering students greater flexibility in terms of time and space (Zhai, 2022). In the 2010s, big data and learning analytics began to be used for personalized learning experiences and adaptive assessments. Additionally, 2020 is viewed as a period during which the use of artificial intelligence-supported chatbots and virtual assistants in education increased, and research on student interaction and participation was also conducted (İşler & Kılıç, 2021). Released in its third version in 2022, ChatGPT-4o has paved the way for a new era in mathematics education, necessitating a reconsideration of instructional approaches in this field. This artificial intelligence-based tool has influenced the way mathematical problems are addressed, thereby demonstrating significant transformative potential in educational processes (Weßels, 2023).

Nowadays, AI applications are frequently used in education studies (Arslan, 2020). ChatGPT-4o, one of the most widely used applications, has attracted considerable attention, particularly in education. ChatGPT-4o is an AI-based chatbot developed by OpenAI (Birer, 2023). Cooper stated that ChatGPT-4o has numerous applications in education, including the potential to assist teachers in creating exams and evaluations (Cooper, 2023). In addition, algorithms can calculate the probability of students failing or dropping out of the course (Bahadir, 2016) or help continuously analyze student success (Zawacki-Richter et al., 2019). Such skills of artificial intelligence have enabled it to become an integral part of the educational vision in many countries in recent years (Wang et al., 2024). Artificial intelligence has been widely utilized to develop educational content, intelligent tutoring methods and systems, assess students, and enhance teacher-student interaction (Suárez, 2025). Artificial intelligence, which is likely to be discussed in the 21st Century, can be defined as an intelligent tutor that differs from other educational technologies. In addition, instead of providing learners with the same content within the program, it will continue to develop a model that enables all individuals to perform the learning function appropriately (Karabıyık, 2024).

Luckin and Holmes grouped AI applications in education into three categories: personal tutors, intelligent support for collaborative learning, and intelligent virtual reality (Luckin & Holmes, 2016). Baker and Smith classified AI tools for students, teachers, and systems (Baker & Smith, 2019). Alam stated that AI can help in education in terms of the educational process (assistance to routine tasks in the educational process and changes in this process) and the educational environment and content (what kind of education is needed for a subject; Alam, 2021). Today, education has evolved into a comprehensive process that incorporates teaching, measurable goals, and results both in and out of the classroom, rather than merely having students memorize specific pre-prepared texts (Kurtuluş & Gece, 2025). These reasons have led us to follow the development of AI.

In addition, examining the procedural development of problem-solving skills in recent years through AI-based modeling and systems such as ChatGPT-4o is important for revealing the potential increase in AI’s success rates over time, particularly in problems of varying difficulty levels. This perspective supports the study’s aims and provides a necessary framework for understanding the effects of artificial intelligence applications in education on learning processes and reasoning.

Review of Literature

The literature reviews conducted indicate that the number of studies addressing the problem-solving capacity of artificial intelligence has been steadily increasing. In particular, the competencies of ChatGPT-4o -based models in mathematical and scientific domains have been thoroughly examined across various research contexts. These studies demonstrate that, when provided with appropriate prompts, ChatGPT-4o is capable of performing advanced mathematical tasks such as integration, executing linear algebraic operations, and solving partial differential equations (Helfrich-Schkarbanenko, 2023). However, such achievements do not imply that the model can independently, consistently, and flawlessly solve mathematical problems under all conditions. It has been reported that ChatGPT-4o can solve many questions included in nationally administered mathematics examinations; nevertheless, this performance is often dependent on explicit instructions, and correct results are not always produced on the first attempt (Korkmaz Guler et al., 2024). In a study comparing the problem-solving performances of ChatGPT-4o and Google Bard across different levels of difficulty, it was reported that both systems yielded more reliable results, particularly in tasks involving simple arithmetic operations and basic logical reasoning (Plevris et al., 2023). Although ChatGPT-4o can be considered a powerful problem-solving tool, its success has been shown to largely vary depending on the complexity of the problem and the specific mathematical domain to which it belongs (Dao & Le, 2023).

Frieder et al. examined ChatGPT-4o’s performance on problems drawn from advanced mathematics textbooks used at elite universities, its ability to detect errors in mathematical proofs, and its performance on Mathematics Olympiad-level problems. Although the findings were noteworthy in certain respects, the researchers concluded that ChatGPT-4o can produce inconsistent and unreliable results when confronted with tasks requiring advanced mathematical reasoning (Frieder et al., 2023).

Overall, it is observed that ChatGPT-4o’s mathematical performance is highly sensitive to variables such as problem type, difficulty level, and mathematical domain. However, a notable limitation of much of the existing literature is that the types of tasks examined have not been systematically classified.

Within the datasets on which ChatGPT-4o is trained, the extent to which specific mathematical domains (e.g., algebra, analysis, or topology), problem types, and levels of difficulty are represented can directly influence the model’s knowledge structure and problem-solving proficiency. Moreover, this data composition may shape ChatGPT-4o’s problem-solving strategies, such as favoring particular numerical methods for differential equations or directing users toward specific digital tools during the solution process. However, the scope and effects of such tendencies have not been clearly established in the literature (Spreitzer et al., 2024).

In conclusion, ChatGPT-4o can be considered a tool capable of solving mathematical problems at different levels with a certain degree of success. Studies comparing ChatGPT-4o versions generally indicate that ChatGPT-4.o outperforms GPT-3.5. This suggests that, when used within appropriate pedagogical frameworks, ChatGPT-4o can serve as a powerful resource to support students’ mathematical understanding. Nevertheless, as ChatGPT-4o has become an integral part of classroom environments, this technology must be integrated into mathematics education, not in an unplanned manner, but through a conscious, systematic, and critical approach. As previously noted, evaluations of ChatGPT-4o’s mathematical performance vary depending on the mathematical level, problem type, and target audience. Therefore, it is of great importance for both teachers and students to accurately comprehend the opportunities and limitations offered by ChatGPT-4o across different mathematical contexts.

Purpose and Importance of the Study

In mathematics education, probability is one of the areas where conceptual difficulties and misconceptions are most frequently encountered (Memnun, 2008). Understanding probability requires deeper and more careful thinking than many other mathematical topics, as well as critical and intuitive reasoning, making logical estimations, and the effective use of a strong mathematical language (İnci Kuzu & Uras, 2018). Probability knowledge, which is widely applied in everyday life and across various professional fields, plays a crucial role in enabling individuals to make sound decisions in situations involving uncertainty (Konold et al., 2011). Despite its importance, probability instruction remains one of the most challenging topics in mathematics for both students and teachers (İnci Kuzu, 2021).

In recent years, with the widespread adoption of artificial intelligence–based large language models in the field of education, the role of tools such as ChatGPT-4o in mathematical problem-solving processes has attracted increasing scholarly attention. Existing studies indicate that ChatGPT-4o demonstrates high performance on basic and routine mathematical problems; however, as problem complexity increases, particularly in multi-step tasks and questions requiring a high cognitive load, its accuracy rates tend to decline (Karabıyık, 2024). Nevertheless, a review of the literature reveals that longitudinal studies examining how the problem-solving abilities of ChatGPT-4o or similar large language models change and develop over time are quite limited. The majority of current research evaluates model performance at a single point in time, thereby overlooking process-oriented development and changes in cognitive consistency.

The primary aim of this study is to address the gap in the literature by examining the process-oriented development of the problem-solving ability of the ChatGPT-4o model developed by OpenAI over an 8-month period. Within the scope of the study, changes over time in the model’s abilities to understand fundamental rules, execute multi-step operations, draw accurate inferences under conditions of uncertainty, and generate consistent solutions across problems of varying difficulty levels were analyzed. In this respect, the research not only describes the current performance of artificial intelligence but also provides empirical evidence on how the mathematical reasoning capabilities of large language models evolve.

In this context, the main research problem was formulated as follows:

How does the problem-solving ability of the ChatGPT-4o model developed by OpenAI change over an eight-month period with respect to questions of varying difficulty levels?

In line with this overarching problem, the following research questions were addressed:

How did ChatGPT-4o’s problem-solving performance across different difficulty levels change over 8 months?

How do the types of errors made by ChatGPT-4o in incorrectly solved questions, and the underlying causes of these errors, change over the 8-month period?

How does development over time affect the model’s consistency and cognitive performance, particularly in multi-step problems and those with a high level of difficulty?

In this respect, the study makes an original contribution to the literature by addressing the mathematical problem-solving processes of artificial intelligence-based large language models from a temporal development perspective, thereby enabling more robust and informed evaluations of the potential use of such systems in mathematics education.

Methodology

Research Model

This study aims to investigate the development of ChatGPT-4o’s performance in solving university-level probability-related questions of varying difficulty levels over an 8-month period. Accordingly, questions categorized as easy, medium, and difficult were presented to ChatGPT-4o, and the accuracy of the responses was evaluated by comparing the results obtained at two different time points separated by 8 months. This research is a descriptive study based on the general screening model. The research described the success of an artificial intelligence-based language model (ChatGPT-4o) in solving questions with different difficulty levels. The difficulty levels of the questions were predetermined. The answers given by the model to these questions were evaluated in terms of accuracy, and the success rates were statistically analyzed.

Study Group and Data Collection

The research study group consists of 38 questions, divided into four grades with varying difficulty levels (easy, medium, and hard) at each level. The questions are coded as L1E1 (Level 1, Easy, 1st question), L2H3 (Level 2, hard 3rd question) according to their grade and difficulty levels.

The relatively small number of questions within subgroups (e.g., specific grade levels and difficulty categories) may be considered a factor limiting statistical power. Nevertheless, in this study, each question was treated as an independent unit of measurement, and the analyses were conducted by administering the same set of questions to ChatGPT-4o at two different time points. In this way, the potential variance effects arising from small subgroup sizes were partially mitigated through the use of a repeated-measures design and non-parametric statistical methods, specifically the Wilcoxon signed-rank test.

Probability Problems—Classification in Four Different Levels

Level 1: Basic Concepts

Goal: Understand the basic concept of probability.

Easy: A die is rolled. What is the probability of rolling a 6?

Medium: What is the probability of rolling an odd number when rolling a die?

Hard: A die is rolled twice. What is the probability of rolling a five on at least one of the rolls?

Required Knowledge: Have knowledge about simple fractions and ratios.

Level 2: Joint Events

Goal: Independent events and joint probability.

Easy: A coin is rolled twice. What is the probability that both roll heads?

Medium: A die and a coin are rolled together. What is the probability that the die rolls a four and the coin rolls tails?

Hard: What is the probability that the sum of the numbers rolled when two dice are rolled is 8?

Required Knowledge: To have knowledge about independence, multiplication, and addition rules.

Level 3: Conditional Probability Objective

Goal: The probability of one event occurring after another event has occurred.

Easy: There are three red and two blue balls in a bag. If the first one is red, what is the probability that the second one is blue?

Medium: Three red and two blue balls are drawn from the bag, one after the other, without replacement. What is the probability that the second ball drawn is red?

Hard: If person A tests positive, what is the probability that they are sick? (A preliminary probability is given.)

Required Information: Conditional probability formula, prior knowledge

Level 4: Combinatorial Probability Objective

Goal: Calculating probability with permutations and combinations.

Easy: In how many different ways can three different books be arranged on a shelf?

Medium: Six different people are to be lined up side by side. In how many different arrangements are possible such that people A and B do not sit next to each other?

Hard: Seven people are to be arranged side by side in a row. People A and B will sit next to each other, while person C will not sit next to either of them. How many different seating arrangements are possible?

Required Information: Combinations, permutations, and regular grouping.

The first data of the study were obtained on 15.10.2024, and the last data were collected on 15.06.2025 from the paid version of ChatGPT-4o. These questions were classified based on expert opinions, ensuring content validity. The questions were directed to ChatGPT-4o at two different time points (first application and re-application 8 months later). The answers provided by the model were scored within the framework of predetermined true/false criteria, and the solution’s success was evaluated accordingly. In this way, the differences that may occur between the answers given by the model to the same questions in the 8 months were made statistically comparable (Table 1).

Table 1.

Level, Difficulty, and Success Distributions of Questions.

Grade	Difficulty level	Number of questions	Success rate (1st measurement)	Success rate (2nd measurement)
1	Easy	1	1.00	1.00
	Medium	1	1.00	1.00
	Hard	3	0.67	1.00
2	Easy	1	1.00	1.00
	Medium	1	1.00	1.00
	Hard	2	0.50	1.00
3	Easy	1	1.00	1.00
	Medium	7	0.71	1.00
	Hard	1	0.00	0.00
4	Easy	1	0.00	1.00
	Medium	5	0.20	1.00
	Hard	14	0.14	0.50

In the study, the correct answer rates of 38 questions classified at different difficulty levels were compared by ChatGPT-4o at two different time points (first and second measurements). The general success rate average in the first measurement was 0.60, and its standard deviation (Esdeira, 2017) was calculated as 0.42. In the second measurement, the success rate average was found to have increased significantly to 0.88. The SD value of the second measurement was 0.31. When Table 1 is examined, it is noted that the success rates generally reached 100% in the second measurement for all easy and medium level questions, and that there was a significant increase in the success rate of especially hard questions. For example, while the success rate in the 4th level, which included 14 hard questions, was only 0.14 in the first measurement, this rate increased to 0.50 in the second. Similarly, the success rate of the seven medium difficulty questions in the 3rd level was 0.71 in the first and reached 1.00 in the second. These findings suggest that ChatGPT-4o can provide more accurate answers to the same questions over time, indicating consistency and improved cognitive outcomes.

Data Analysis

Descriptive statistics were used to analyze the data obtained within the scope of the research. The success rates in the first and second measurements of 38 questions classified by difficulty level using ChatGPT-4o were calculated, and the mean and standard deviation values were determined.

The Wilcoxon signed-rank test was applied to determine whether the success rates between the two measurements contained a statistically significant difference, as the data did not meet the assumption of normal distribution. All analyses were performed using the SPSS 25.0 package program. Then, the solutions to each incorrectly made problem were subjected to content analysis by the researchers, and the reasons for the errors were examined.

Results

Within the scope of the research, difference-in-difference analyses were conducted for each grade and question with different difficulty levels at these grades, as well as for the total number of questions. First, difference-in-difference analyses were conducted on measurements of different grades and the total number of questions. To evaluate the success of ChatGPT-4o in answering questions classified at different difficulty grades over time, statistical comparisons were made between the two measurements. In this direction, the non-parametric Wilcoxon signed-rank test was applied to determine whether the success rates changed significantly over time. Analysis results are presented in Table 2.

Table 2.

Comparison of Success Rates by Grade.

Grades	Measurement	Average rank	Z	p	η²
Grade 1	1	0.00	−1.00	.317	—
Grade 1	2	1.00	−1.00	.317	—
Grade 2	1	0.00	−1.00	.317	—
Grade 2	2	1.00	−1.00	.317	—
Grade 3	1	0.00	−1.00	.317	—
Grade 3	2	1.00	−1.00	.317	—
Grade 4	1	0.00	−1.61	.109	—
Grade 4	2	6.00	−1.61	.109	—
Total	1	0.00	−2.20	.028	.13
Total	2	21.00	−2.20	.028	.13

According to Table 2, no statistically significant difference was found in the comparisons made for the questions at the first, second, and third grades (Grade 1: Z = −1.00, p = .317; Grade 2: Z = −1.00, p = .317; Grade 3: Z = −1.00, p = .317). These findings reveal that the responses given by ChatGPT-4o to the questions at these levels did not show any significant difference over time. Despite the increase in success rates at the fourth grade, the statistical significance limit was not reached (Z = −1.61, p = .109). However, as a result of the Wilcoxon test conducted at the total level (when all questions were considered together), the difference between the first and second measurements was found to be significant (Z = −2.20, p = .028). This situation reveals that ChatGPT-4o showed a statistically significant development in the general success level over time. Additionally, the effect size calculated to evaluate the impact of this difference was η² = .13. According to Cohen’s classification, this value indicates a medium effect size (Cohen, 2013). In other words, the difference in success between the measurements is statistically significant and practically remarkable. These findings show that large language models can give more accurate and consistent answers to the same questions over time, and that learning processes or model updates contribute to increased performance. Figure 1 summarizes the change in ChatGPT-4o’s success in answering questions classified at different levels over 8 months.

Figure 1.

Success rates by grades.

After examining the difference between the two measurements according to different grades, the success rates of both measurements according to different difficulty levels were compared, and the findings are given in Table 3. Table 3 shows no significant difference between the success rates in the first and second measurements for easy questions (Z = −1.00, p = .317). Similarly, no statistically significant difference was observed between the measurements for medium (Z = −1.34, p = .180) and hard (Z = −1.61, p = .109) questions. These results show that the model’s performance remained consistent in each category, regardless of the difficulty level, and did not significantly improve.

Table 3.

Comparison of Success Rates Across Difficulty Levels.

Levels	Measurement	Average rank	Z	p	η²
Easy	1	0.00	−1.00	.317	—
Easy	2	1.00	−1.00	.317	—
Medium	1	0.00	−1.34	.180	—
Medium	2	3.00	−1.34	.180	—
Hard	1	0.00	−1.61	.109	—
Hard	2	6.00	−1.61	.109	—
Total	1	0.00	−2.20	.028	.13
Total	2	21.00	−2.20	.028	.13

Figure 2 summarizes the change in ChatGPT-4o’s success in answering classified questions of different difficulty levels over 8 months. On the other hand, when all questions were considered together, when compared at the general success rate level, a significant difference emerged (Z = −2.20, p = .028). It was determined that the success level of ChatGPT-4o in the second measurement was significantly higher than in the first. This finding shows that the model can provide more successful answers to the same questions at a general level over time. As a result, it can be said that ChatGPT-4o showed a performance increase over time at different difficulty levels, although not statistically significant, and that the general success rate improved significantly, especially when all levels were considered collectively.

Figure 2.

Success rates according to difficulty levels.

The example presented in Figure 3 provides clear evidence of a temporal improvement in ChatGPT-4o’s problem-solving performance, particularly in terms of accurately interpreting probability questions and employing more coherent and effective solution strategies over time. When Figure 3 is examined, ChatGPT-4o, in its first solution, perceived the odd dice as 1-1, 3-3, and 5-5. However, there are also situations of 1-3, 1-5, 3-1, 3-5, 5-1, 5-3. Therefore, the solution was incorrect. However, in the second solution, it understood the question correctly. First, it calculated the probabilities of the dice not being odd and removed them from the probability calculation, thus arriving at the correct result.

Figure 3.

Comparison of ChatGPT-4o’s solutions to problem G1H1 8 months before and after.

Qualitative Findings on ChatGPT-4o’s Problem-Solving Performance: A Comparison of Initial and Final Assessments

Table 4 summarizes the qualitative analysis results regarding ChatGPT-4o’s performance in solving probability problems, presenting a comparison between the initial and follow-up measurements conducted over an 8-month period and highlighting the observed developments in terms of accuracy, reasoning, solution strategies, and problem-solving consistency.

Table 4.

Qualitative Findings on ChatGPT-4o’s Problem-Solving Performance Over 8 Months.

Dimension examined	First measurement	Second measurement (8 months later)
Understanding of basic concepts and standard problems	High rate of correctly solving problems that were correctly understood.	This relationship remained consistent, with improved reliability.
Use of standard formulas	Able to solve some non-routine problem types correctly, but performance was limited.	The number of correctly solved non-routine and novel problem scenarios increased.
Accuracy with increasing difficulty	Applied the elimination method and listed all possible cases in some problems.	Strategies were used more frequently and effectively.
Explanation of the solution process	Able to reach correct solutions in some high-distractor problems.	The frequency of correct solutions in such problems increased.
Correct understanding–correct solution relationship	Higher error rates were observed, including reasoning mistakes.	Correct solution rate increased, and reasoning errors decreased.
Non-routine problem scenarios	High tendency for incorrect inferences in long and multi-step problems.	The rate of incorrect inferences decreased, and logical consistency improved.
Problem-solving strategies	Errors observed in decimal multiplication, combination calculations, and simplifications.	Computational errors were significantly reduced.
High-distractor questions	Missed some possible outcomes and produced incomplete solutions.	Reduced the rate of missed critical points and expanded solution coverage.
Multi-step problems	High error rate in problems involving dependent events.	Accuracy in dependent event problems increased.
Long and complex problems	Made serious errors in some complex problems, e.g., calculating probabilities greater than 1.	Serious logical errors decreased, and the rate of correct solutions increased.
Computational errors	High rate of correctly solving problems that were correctly understood.	This relationship remained consistent, with improved reliability.
Multi-outcome or multi-step reasoning	Able to solve some non-routine problem types correctly, but performance was limited.	The number of correctly solved non-routine and novel problem scenarios increased.
Dependent event problems	Applied the elimination method and listed all possible cases in some problems.	Strategies were used more frequently and effectively.

The results indicate that ChatGPT-4o demonstrated a marked improvement in problem-solving performance over the 8-month period, particularly in probability problems involving dependent and conditional events. While the initial measurement revealed high error rates and incomplete reasoning in such problems, the final measurement showed that ChatGPT-4o more accurately identified dependencies between events, consistently established conditional probability relationships, and consequently achieved higher rates of correct solutions. Furthermore, whereas the initial measurement exhibited various types of errors, including computational errors (e.g., decimal multiplication, combination calculations, and simplifications), logical and reasoning errors, incomplete solutions or omitted possible cases, incorrect inferences in multi-step problems, and errors arising from susceptibility to distracting elements, these error types were substantially reduced in the final measurement. In addition, solution strategies were applied in a more effective and systematic manner, and the explanations provided became more coherent and pedagogically robust.

Discussion

In this study, the ChatGPT-4o model was analyzed over an 8-month period with respect to its capability to solve probability problems, and variations in its performance were systematically examined. The results indicate a gradual improvement in the model’s performance over time. According to the first research question, increases in correct response rates were observed for problems at medium and great difficulty levels. Nevertheless, the model continues to exhibit limitations in tasks involving conditional probability, dependent events, and multi-step reasoning processes. Occasional logical inconsistencies and inaccurate inferences were also identified. These findings suggest that while artificial intelligence–based language models may demonstrate performance improvements over time, they still face challenges when addressing highly complex and cognitively demanding problem scenarios. Comparable results have been reported in earlier studies (Sabzalieva & Valentini, 2023; Wang et al., 2024).

According to the second research question, it was determined that ChatGPT-4o makes errors when multiplying decimal numbers; it also makes mistakes in combination calculations in some problems and, in some cases, simple simplification errors. These findings indicate that ChatGPT-4o is prone to making operational or arithmetic errors. Furthermore, a one-way examination of the solution steps provided by ChatGPT-4o was insufficient to verify their accuracy. This shows that each step in ChatGPT-4o’s problem-solving process should be evaluated not only for whether the problem is correctly interpreted, but also for whether the mathematical operations are performed correctly. Similarly, Wang and colleagues also determined that ChatGPT-4o produces computational errors (Wang et al., 2024).

In problems involving dependent events, where the outcome of one event affects the occurrence of another, ChatGPT-4o produced incorrect solutions in multiple instances. This issue suggests that ChatGPT-4o’s performance in such problem types warrants more in-depth investigation. In a study examining ChatGPT-4o’s problem-solving abilities, Marinosyan (2024) reported that although ChatGPT-4o demonstrates a relatively strong level of conceptual understanding, it falls short in areas requiring rigorous mathematical proofs and consistent logical reasoning. While the model exhibits high performance on questions that can be solved using standard formulas, it still requires further development in tasks that demand the interpretation of special cases and multi-step logical reasoning (Alibrahim, 2024; Wardat et al., 2023).

According to the third research question, in some complex or high-difficulty problems, ChatGPT-4o attempted to generate alternative solution approaches but failed to arrive at the correct result. In certain cases, its logical reasoning process appeared to break down to such an extent that the model produced probability values greater than one. ChatGPT-4o primarily relies on algebraic equations and formulas to obtain correct solutions and is not particularly effective in employing other problem-solving strategies, such as estimation, verification, and refinement, resulting in inconsistent accuracy in its responses (Getenet, 2024; Marinosyan, 2024). These findings are consistent with the work of Frieder et al., who argue that mathematical and contextual complexity constitute a critical determinant of ChatGPT-4o’s performance (Frieder et al., 2023). Similarly, other studies indicate that mathematical and contextual complexity may significantly influence outcomes and that performance may remain limited in complex tasks (Dao & Le, 2023; Wardat et al., 2023).

The study found that ChatGPT-4o produced more consistent and accurate results in probability problems (Ray, 2023). These observed performance improvements are related to developer updates, expansion of training data, and improvements in model architecture; they are not the result of organic learning in a pedagogical sense and cannot be directly compared to human learning. In mathematics education, students can use ChatGPT-4o as a supplementary tool in basic modeling and standard problem-solving tasks; the model provides reliable results for simple calculations and basic logical reasoning, allowing teachers to focus on more complex problems. The possibility of incomplete or incorrect answers in intermediate or multi-step problems highlights the importance of critical evaluation and teacher supervision. It draws attention to the risk of student over-reliance (İnci Kuzu, 2026). Analyzing ChatGPT-4o’s solution errors presents a pedagogical opportunity, contributing to the development of both the use of AI-based tools and the ability to identify and correct incorrect solutions. Therefore, the model should be used as a supplementary educational tool, not a primary problem-solving tool, and traditional teaching methods and teacher supervision remain critical in complex tasks.

In summary, the rapid development exhibited by ChatGPT-4o models over a relatively short period, along with improvements in their problem-solving capabilities, offers significant potential for the integration of artificial intelligence into mathematics education. Furthermore, while supporting students’ basic problem-solving skills, it allows teachers to allocate classroom time more effectively to tasks that require greater complexity and interpretation. However, given the model’s current limitations, students need to continue developing their critical thinking and independent problem-solving skills. Maintaining this balance while leveraging ChatGPT-4o’s strengths is critical to students’ academic and professional success.

Conclusion and Recommendations

To enable more effective use of ChatGPT-4o in mathematics education, certain precautions and enhancements are required. Teacher supervision is of critical importance in the use of the model, particularly for problems that are open to interpretation, multi-step in nature, or of great difficulty. Enhancing ChatGPT-4o’s performance necessitates training processes enriched with mathematical data and the further development of its ability to generate step-by-step solutions. Artificial intelligence tools should be employed as complementary resources that support students’ fundamental problem-solving skills, and excessive student dependence on such tools should be avoided. Moreover, analyzing the model’s solution errors can be leveraged as a pedagogical strategy to foster students’ logical reasoning and problem-solving abilities. In the future, artificial intelligence systems are expected to improve their performance by minimizing problem-solving errors. However, from a pedagogical perspective, whether such systems can independently manage students’ learning processes remains a distinct and important topic for further research.

Limitations

The 38-item question set used in this study was developed within the framework of a descriptive research design aimed at characterizing ChatGPT-4o’s performance in solving probability problems across different levels of difficulty, rather than enabling statistical generalization. This limited dataset may be statistically insufficient for drawing broad conclusions. In addition, the exclusive focus on probability limits the extent to which a general evaluation of artificial intelligence performance can be made. External factors such as the timing and context in which the questions were administered, system load, internet connectivity, or the operational state of the platform at the time may have influenced the model’s performance. The findings obtained in this study are therefore likely specific to the particular model examined and the questions posed. Similar results may not necessarily be observed with different artificial intelligence models or across other disciplinary domains.

During the design phase of the study, the use of the existing artificial intelligence model (ChatGPT-4o) was planned. However, when evaluating the performance of ChatGPT-4o, the rapid advancement of artificial intelligence models, particularly the emergence of next-generation systems such as ChatGPT-5, may influence the interpretation of the findings. It is anticipated that ChatGPT-5, supported by more advanced algorithms and substantially larger datasets, may achieve higher levels of accuracy. This advancement could lead to a pronounced performance difference compared to ChatGPT-4o, especially in the context of complex probability problems. Future research may therefore be required to examine in greater detail the potential educational impacts of newer models such as ChatGPT-5. Accordingly, the findings of the present study should be interpreted with the understanding that they may change as more advanced artificial intelligence models become available.

Footnotes

ORCID iD

Çiğdem İnci Kuzu

Author Contributions

The authors have contributed equally throughout all stages of this work.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Karabuk University Scientific Research Project Unit supported this study within the scope of project number KBÜBAP-25-YL-033.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

There are no datasets associated with this research.

References

Alam

(2021, December 3–4). Should robots replace teachers? Mobilisation of AI and learning analytics in education [Conference session]. 2021 International Conference on Advances in Computing, Communication, and Control (ICAC3), Mumbai.

Alibrahim

A. A.

(2024). Is ChatGPT reliable in education? South African Journal of Education, 44(4), 1–9. https://doi.org/10.15700/saje.v44n4a2557

Arslan

A. K.

(2020). A design framework for auditing AI. Journal of Multidisciplinary Engineering Science and Technology (JMEST), 7(10), 12768–12776. https://www.jmest.org/wp-content/uploads/JMESTN42353353.pdf

Bahadir

(2016). Using neural network and logistic regression analysis to predict prospective mathematics teachers’ academic success upon entering graduate education. Educational Sciences: Theory and Practice, 16(3), 943–964. https://doi.org/10.12738/estp.2016.3.0214

Baker

Smith

(2019). Educ-AI-tion Rebooted? Exploring the future of artificialintelligence in schools and colleges. Nesta.

Birer

(2023). ChatGPT. TÜBİTAK Bilim ve Teknik Dergisi, 56(662), 36–37.

Cohen

(2013). Statistical power analysis for the behavioral sciences. Routledge.

Cooper

(2023). Examining science education in ChatGPT: An exploratory study of generative artificial intelligence. Journal of Science Education and Technology, 32(3), 444–452. https://doi.org/10.1007/s10956-023-10039-y

Dao

X.-Q.

N.-B.

(2023). Investigating the effectiveness of ChatGPT in mathematical reasoning and problem solving: Evidence from the Vietnamese national high school graduation examination. arXiv. https://doi.org/10.48550/arXiv.2306.06331

10.

Ersoy

Guner

(2015). The place of problem solving and mathematical thinking in the mathematical teaching. The Online Journal of New Horizons in Education-January, 5(1), 120–130. https://www.tojdel.net/journals/tojned/articles/v05i01/v05i01-12.pdf

11.

Esdeira

F. A. A.

(2017). The investigation of semantic learning environments for eknowledge management [Master’s thesis, Kastamonu University].

12.

Frieder

Pinchetti

Chevalier

Griffiths

R.-R.

Salvatori

Lukasiewicz

Petersen

Berner

(2023). Mathematical capabilities of ChatGPT [Conference session]. Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA.

13.

Getenet

(2024). Pre-service teachers and ChatGPT in multistrategy problem-solving: Implications for mathematics teaching in primary schools. International Electronic Journal of Mathematics Education, 19(1), Article em0766. https://doi.org/10.29333/iejme/14141

14.

Gülleroğlu

H. D.

Coşkun

(2021). Development of artificial intelligence in history and its usage in education. Ankara University Journal of Faculty of Educational Sciences (JFES), 54(3), 947–966. https://doi.org/10.30964/auebfd.916220

15.

Helfrich-Schkarbanenko

(2023). Mathematics and ChatGPT: A rendezvous at the foot of the technological singularity. Springer. https://doi.org/10.1007/978-3-662-68209-8

16.

İnci Kuzu

. (2021). Views of mathematics teacher candidates on the use of Geogebra in probability teaching. Asian Journal of Contemporary Education, 5(1), 45–56. https://doi.org/10.18488/journal.137.2021.51.45.56

17.

İnci Kuzu

. (2026). Mathematics teachers’ AI literacy, anxiety, and perceptions of AI integration in mathematics education: A mixed-methods study. BMC Psychology, 14(1), 42. https://doi.org/10.1186/s40359-025-03836-0

18.

İnci Kuzu

Ç.

Uras

M. C

. (2018). The subjects that the pre-service classroom teachers perceive as difficult in elementary mathematics curriculum. Universal Journal of Educational Research, 6(10), 2153–2159. https://doi.org/10.13189/ujer.2018.061013

19.

İşler

Kılıç

(2021). The use and development of artificial intelligence in education. e-Journal of New Media, 5(1), 1–11. https://dergipark.org.tr/en/pub/ejnm/issue/58097/738221

20.

Karabıyık

. (2024). An innovative approach to mathematics education: The role of ChatGPT. Uşak Üniversitesi Eğitim Araştırmaları Dergisi, 10(1), 26–46. https://doi.org/10.29065/usakead.1393487

21.

Kolhatin

A. O.

(2025, May 13). From automation to augmentation: A human-centered framework for generative AI in adaptive educational content creation [Conference session]. CEUR Workshop Proceedings, Kryvyi Rih.

22.

Konold

Madden

Pollatsek

Pfannkuch

Wild

Ziedins

Finzer

Horton

N. J.

Kazak

(2011). Conceptual challenges in coordinating theoretical and data-centered estimates of probability. Mathematical Thinking and Learning, 13(1–2), 68–86. https://doi.org/10.1080/10986065.2011.538299

23.

Korkmaz Guler

Dertli

Z. G.

Boran

Yildiz

. (2024). An artificial intelligence application in mathematics education: Evaluating ChatGPT’s academic achievement in a mathematics exam. Pedagogical Research, 9(2), Article em0188. https://doi.org/10.29333/pr/14145

24.

Koukaras

Ioannidis

Stavrinides

S. G.

(2025). AI-driven telecommunications for smart classrooms: Transforming education through personalized learning and secure networks. Telecom, 6(2), 21. https://doi.org/10.3390/telecom6020021

25.

Kurtuluş

A. N.

Gece

. (2025). A comparison of jean Jacques Rousseau’s naturalist educational philosophy and Friedrich Nietzsche’s cultural educational philosophy [A Comparison of Jean Jacques Rousseau’s Naturalist Educational Philosophy and Friedrich Nietzsche’s Cultural Educational Philosophy]. Kalemname, 10(20), 260–271. https://dergipark.org.tr/en/pub/kalemname/article/1823272

26.

Kuzu

Ç. I.

Erdoğan Kayabaşı

. (2025). Determining the digital literacy levels of mathematics pre-service teachers. Problems of Education in the 21st Century, 83(4), 563–578. https://doi.org/10.33225/pec/25.83.563

27.

Luckin

Holmes

(2016). Intelligence unleashed: An argument for AI in education. UCL Knowledge Lab. https://www.pearson.com/content/dam/corporate/global/pearson-dot-com/files/innovation/Intelligence-Unleashed-Publication.pdf

28.

Luong

V. T.

Long

T. T.

Linh

N. T. T.

(2025). Personalized mathematics teaching with the support of AI Chatbots to improve mathematical problem-solving competence for high school students in Vietnam. European Journal of Educational Research, 14(1), 323–333. https://doi.org/10.12973/eu-jer.14.1.323

29.

Marinosyan

A. K.

(2024). LLMs in physics and mathematics education and problem solving: Assessment of ChatGPT-4 Level and suggestions for improvement. https://cyberinka.ru/article/article/n/chtgpt-4-obucheni-i-mumezin-i-vozmohno-ograniciya-i-perpektivy-sversashestvovaniya

30.

Memnun

D. S.

(2008). Difficulties of learning probability concepts the reasons why these concepts cannot be learned and suggestions for solution. İnönüÜniversitesi Eğitim Fakültesi Dergisi, 9(15), 89–101. https://dergipark.org.tr/en/pub/inuefd/issue/8708/108730

31.

Noster

Gerber

Siller

H.-S.

(2024). Pre-service teachers’ approaches in solving mathematics tasks with ChatGPT. Digital Experiences in Mathematics Education, 10(3), 543–567. https://doi.org/10.1007/s40751-024-00155-8

32.

Opesemowo

O. A.

Ndlovu

(2024). Artificial intelligence in mathematics education: The good, the bad, and the ugly. Journal of Pedagogical Research, 8(3), 333–346. https://doi.org/10.33902/JPR.202426428

33.

Opesemowo

O. A. G.

Abanikannda

M. O.

Iwintolu

R. O.

(2024). Exploring the potentials of ChatGPT for instructional assessment: Lecturers’ attitude and perception. Interdisciplinary Journal of Education Research, 6, 1–12. https://doi.org/10.38140/ijer-2024.vol6.21

34.

Pinheiro

M. M.

Santos

(2025, August 22). Integrating computational thinking and artificial intelligence into mathematics education: A pathway for the 21st-Century. In Reis

Cravino

J. P.

Hadjileontiadis

Martins

Dias

S. B.

Hadjileontiadou

Mikropoulos

(Eds.), Technology and innovation in learning, teaching and education (Vol. 2479, pp. 82–92). Springer.

35.

Plevris

Papazafeiropoulos

Jiménez Rios

(2023). Chatbots put to the test in math and logic problems: A comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard. AI, 4(4), 949–969. https://doi.org/10.3390/ai4040048

36.

Ray

P. P.

(2023). ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 3, 121–154. https://doi.org/10.1016/j.iotcps.2023.04.003

37.

Sabzalieva

Valentini

(2023). ChatGPT and artificial intelligence in higher education: Quick start guide. https://eduq.info/xmlui/handle/11515/38828

38.

Sposato

(2025). Artificial intelligence in educational leadership: A comprehensive taxonomy and future directions. International Journal of Educational Technology in Higher Education, 22(1), Article 20. https://doi.org/10.1186/s41239-025-00517-1

39.

Spreitzer

Straser

Zehetmeier

Maaß

(2024). Mathematical modelling abilities of artificial intelligence tools: The case of ChatGPT. Education Sciences, 14(7), Article 698. https://doi.org/10.3390/educsci14070698

40.

Suárez

L. C.

(2025, March, 3–5). Teaching with artificial intelligence in virtual institutions: A Comparative study on student-teacher interaction [Conference session]. INTED2025 Proceedings, Valencia, Spain.

41.

Wang

K. D.

Burkholder

Wieman

Salehi

Haber

(2024). Examining the potential and pitfalls of ChatGPT in science and engineering problem-solving [Original Research]. Frontiers in Education, 8, Article 1330486. https://doi.org/10.3389/feduc.2023.1330486

42.

Wardat

Tashtoush

M. A.

AlAli

Jarrah

A. M.

(2023). ChatGPT: A revolutionary tool for teaching and learning mathematics. Eurasia Journal of Mathematics, Science and Technology Education, 19(7), Article em2286. https://doi.org/10.29333/ejmste/13272

43.

Watters

(2023). Teaching machines: The history of personalized learning. MIT Press.

44.

Weßels

(2023). ChatGPT – a milestone in AI development. Communications of the German Mathematical Society, 31(1), 17–19. https://doi.org/10.1515/dmvm-2023-0007

45.

Yazgan

(2007). Observations about fourth and fifth grade students’ strategies to solve non-routine problems [Observations about fourth and fifth grade students’ strategies to solve non-routine problems]. Elementary Education Online, 6(2), 249–263. https://dergipark.org.tr/en/pub/ilkonline/issue/8604/107172

46.

Zawacki-Richter

Marín

V. I.

Bond

Gouverneur

(2019). Systematic review of research on artificial intelligence applications in higher education – where are the educators? International Journal of Educational Technology in Higher Education, 16(39), 1–27. https://doi.org/10.1186/s41239-019-0171-0

47.

Zhai

(2022). ChatGPT user experience: Implications for education. SSRN. https://doi.org/10.2139/ssrn.4312418

48.

Zhang

Aslan

A. B.

(2021). AI technologies for education: Recent research & future directions. Computers and Education: Artificial Intelligence, 2, Article 100025. https://doi.org/10.1016/j.caeai.2021.100025

The Evolution of Artificial Intelligence in Mathematical Problem-Solving Processes: A Mathematics Education Perspective

Abstract

Keywords

Introduction

Theoretical and Conceptual Framework

Review of Literature

Purpose and Importance of the Study

Methodology

Research Model

Study Group and Data Collection

Probability Problems—Classification in Four Different Levels

Data Analysis

Results

Qualitative Findings on ChatGPT-4o’s Problem-Solving Performance: A Comparison of Initial and Final Assessments

Discussion

Conclusion and Recommendations

Limitations

Footnotes

ORCID iD

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References