Sage Journals: Discover world-class research

Abstract

Most existing arithmetic word problem (AWP) solvers focus on solving simple examples. Transfer case-AWPs (TC-AWPs) involve scenarios where objects are transferred between agents. The widely used AWP datasets mainly consist of simple TC-AWPs (problems that involve a single object transfer). Current large language models (LLMs) are capable of solving most of these simple TC-AWPs effectively. In this work, we focus on assessing the solving capability of LLMs (ChatGPT and Gemini) for complex TC-AWPs (where multiple types of objects are transferred or more than one transfer of an object is performed). Since the popular AWP datasets contain only simple TC-AWPs, we first generate complex TC-AWPs using an ontological approach. We utilize these complex examples to assess LLMs’ word-problem-solving capabilities. We observe that the accuracy of LLMs falls down rapidly as the number of object transfers is increased to 3 or 4. An approach for solving TC-AWPs using ontologies and M/L exists in the literature. We propose an extension of this approach that can handle complex TC-AWPs and find that, compared to the current LLMs, the proposed solution gives better accuracy for complex TC-AWPs. We analyze the failed cases of the LLM approach and find that the reasoning capabilities of LLMs need a lot of improvement.

Keywords

Arithmetic Word Problems Ontology Large Language Models Reasoning Semantic Web Rule Language (SWRL)

1. Introduction

Arithmetic word problems (AWPs) are elementary math problems in which numbers are dispersed in the problem text, and they can be solved by combining these numbers with basic math operations (addition, subtraction, multiplication, and division). Transfer case-AWPs (TC-AWPs), a subset of AWPs, are those word problems where problem texts involve object transfers among agents. The popular AWP datasets such as AllArith (Roy & Roth, 2017), MAWPS (Koncel-Kedziorski et al., 2016b), and Dolphin (Huang et al., 2016) contain simple TC-AWPs (i.e., word problems involving a single object transfer). Kumar and Kumar (2024) focused on solving simple TC-AWPs by proposing a knowledge and learning-based approach. They developed TC-Ontology to encode domain knowledge and utilized it in the solution (i.e., automatic solver). Furthermore, Kumar and Sreenivasa Kumar (2024) extended the TC-Ontology and leveraged it for checking the mathematical validity of the machine-generated TC-AWPs. Our work extends the ideas proposed in these two approaches. The proposed work focuses on only the AWP-Tr domain (i.e., both simple and complex TC-AWPs expressed in English). Since the existing datasets do not contain complex TC-AWPs, we first generate such AWPs. We consider a TC-AWP complex when it involves more than one object transfer of either a single type or multiple types of objects. The examples of simple and complex TC-AWPs are given in Figure 1.

Figure 1.

Examples of simple and complex transfer case-arithmetic word problems (TC-AWPs).

In the last decade, AWP solving has been widely attempted, and the state-of-the-art (SOTA) approaches have evolved around the following ideas: rule-based solution, statistical modeling, tree-based modeling, template-based solution, incorporating domain knowledge, neural-based models, etc. (Zhang et al., 2018). With the arrival of large language models (LLMs), all these models became less popular as LLMs could solve AWPs more effectively. Therefore, the proposed work focuses on assessing the AWP-solving capabilities of LLMs (ChatGPT-3.5¹ and Gemini²). We focus on the SOTA language models that provide user interfaces to interact with and are also openly available. Therefore, we exclude the SOTA LLM models, such as WizardMath (Luo et al., 2023), MAmmoTH (Yue et al., 2023), and LLaMa-2 (Touvron et al., 2023), and non-LLM models, such as Text2Math (Zou & Lu, 2019). In this context, incorrect answers are dangerous and can mislead the users. It is assumed that a general user does not have an idea about how to use prompts. Therefore, we assess LLMs as they are. Our assessment includes only complex TC-AWPs. We observed that LLMs could not solve a large proportion of these examples.³ A few example TC-AWPs that LLMs could not solve are given in Appendix 9. The proposed ontology-based approach performs better than LLMs while solving complex TC-AWPs.

Concerning the TC-AWP domain, the existing works that adopted ontology-based modeling, the solver (Kumar & Kumar, 2024), and the validity-checker (Kumar & Sreenivasa Kumar, 2024), focused on processing word problems sentence-wise. They identified the following four sentence categories: before transfer (BT; e.g., Stephen has 17 books), transfer (TR; e.g., Stephen gave 5 books to Daniel), after transfer (AT; e.g., Now Daniel has 15 books), and question (QS; e.g., How many books does Stephen have now?). These four sentence categories are represented as four ontology classes (details are in Section 3). The above-mentioned systems incorporated domain knowledge by developing an ontology (namely TC-Ontology). Since the proposed work leverages TC-Ontology, we include a summary in Section 3 and discuss the required extension.

In summary, the proposed work shows the importance of incorporating domain knowledge in the tasks related to the AWP domain, such as generation and solving. Our contributions are:

We extend the TC-Ontology to demonstrate the generation of complex TC-AWPs from the ontological representations of simple TC-AWPs. This process required enhancements to the ontology to accommodate the creation of more complex problems.

We utilize complex TC-AWPs to evaluate the performance of LLMs in solving these problems. Furthermore, we propose an ontological approach for solving complex TC-AWPs, which showcases enhanced reasoning capabilities and proves to be more effective in solving these complex word problems.

The remaining article is organized as follows—Section 2 details the related work. Section 3 briefly discusses the background and TC-Ontology. Section 4 presents the proposed approach. Section 5 discusses the experimental setup and results. Limitations of the proposed approach and conclusions of the work are given in Sections 6 and 7, respectively.

2. Related Work

In the proposed work, we focus on both the generation and solving aspects of TC-AWPs.⁴ Various approaches have been proposed in the literature for solving word problems; however, the generation aspect of word problems is not widely attempted.

2.1. Approaches for AWP Generation

Williams proposes a prototype to convert Web Ontology Language (OWL) ontologies into word problems. The approach uses the SWAT tool (Robert et al., 2011) to convert the lexical entries into English statements. For example, the property “hasType” is converted to “is a kind of” natural language text. The generated sentences are then grouped together to form a word problem text. Polozov et al. (2015) use answer set programming to generate word problem text from student and teacher requirements. Koncel-Kedziorski et al. (2016a) proposed a theme-rewriting approach for generating algebra math word problems. The drawback of the approach is that it requires another word problem as an input, resulting in the new problem having a similar template to the input.

The neural network-based approach Zhou and Huang (2019) generates word problems from equations and topics. Two recurrent neural network encoders map equations and topics to hidden vectors and word representations, respectively. The outputs of these two encoders are then concatenated and fed to a decoder to generate word problems. The method requires a large amount of annotated data. The authors state that the following two types of errors are observed in the generated examples: (a) problem soundness-generated examples lack semantic coherence and (b) equation matchness-template of the input equation partially correlates to the output. However, they do not mention the proportion of such examples. Given an equation template: $z = x - y$ and topic words: {has, apples, gave}, the system may generate the following word problem: “Stephen has 5 apples. He gave 10 apples to Mike. How many apples does Stephen have now?” In the KLAUS-Tr system (Kumar & Kumar, 2024), the authors define the above problem as invalid, as one cannot give more than one owns. The neural network-based work does not mention how appropriate domain knowledge is provided to the system.

Attempts have been made toward how to utilize commonsense knowledge (Liu et al., 2021; Qin et al., 2023) and improve mathematical validity (Wang et al., 2021) during generation. Multilingual language models for word problem generation are also being explored (Niyarepola et al., 2022).

Our proposed work generates word problems related to a specific topic, that is, transfer-type AWPs. It takes input as simple TC-AWPs and generates complex TC-AWPs. Like other learning-based approaches, our approach does not require additional training examples and annotations. Since the generation process is backed by domain knowledge, the generated word problems are always valid.

2.2. Approaches for AWP Solving

Before the LLM era, the following were the popular SOTA AWP solvers (Roy & Roth, 2017, 2018; Wang et al., 2018b; Zou & Lu, 2019). We skip the discussion on these solvers as we primarily focus on analyzing the AWP-solving capabilities of LLM-based solvers.

Wei et al. (2022) adopt and test the idea of chain-of-thought prompting in LaMDA (Thoppilan et al., 2022), generative pretrained transformer-3 (GPT-3; Brown et al., 2020), PaLM (Chowdhery et al., 2022), UL2 20B (Tay et al., 2023), and Codex (Chen et al., 2021) LLMs to improve the arithmetic reasoning. Word problem-solving ability of LLMs is highly dependent on how well the prompts are designed. Since we assess Gemini and ChatGPT LLMs as they are, we do not discuss the design of the prompts and also exclude the detailed discussion on prompt-based solutions. However, it appears that the current versions of LLM’s we have tested already incorporated chain-of-thought prompt-based training. In the output, we obtained on sample cases, intermediate statements were generated by the model before the final answer was given. In spite of this, the results on complex problems are not very good. The proposed approach provides a simple solution to this challenging problem (i.e., performing a series of intermediate reasoning steps effectively).

Zong and Krishnamachari (2023) focused on solving math word problems using the GPT-3 model. They focused on analyzing the following three tasks using the GPT-3 model: classifying word problems, extracting equations from the problem text, and generating similar word problems using one given example. The work shows promising results for all three tasks mentioned above. However, the paper states that directly applying commonsense knowledge to improve word-problem solving remains an issue. Also, scaling up the model size alone seems not to be sufficient for achieving high performance on the reasoning tasks (i.e., arithmetic, commonsense, and symbolic; Rae et al., 2021). Therefore, the domain knowledge-based solutions should be explored further. In the proposed work, we show how to encode and utilize domain knowledge while solving word problems.

3. Background and TC-Ontology

In this section, we discuss the background of ontologies. Since our proposed approach extends and utilizes the TC-Ontology developed in previous work, we also provide a summary of that ontology.

3.1. Background

Ontology is a formal framework for representing knowledge within an application domain. It provides a structured way of modeling concepts (classes), properties (roles) that detail the attributes and relationships of these concepts, and constraints on these properties. This structured representation facilitates consistent understanding, interoperability, and reasoning across systems, making it crucial in fields such as artificial intelligence, the semantic web, and information science. The Resource Description Framework Schema (RDFS; Brickley & Guha, 2004) and the OWL (Bechhofer et al., 2004) are two widely used frameworks for computationally processing ontologies, each differing in their levels of expressive capabilities. The Resource Description Framework (RDF; Klyne & Carroll, 2004) is a data modeling standard that facilitates effective information exchange across the web with reasoning capabilities. It serves as the foundation for building RDFS and OWL technologies. In the following, we discuss RDF, RDFS, and OWL in brief.

(a)
Resource Description Framework (RDF): RDF (Klyne & Carroll, 2004) serves as a foundational framework for encoding, reusing, and exchanging structured metadata. It enables the representation of various resources within a domain through statements structured as triplets (subject, predicate, and object). Each element of these triplets is uniquely identified by a uniform resource identifier or an internationalized resource identifier. RDF serves as the foundation for RDFS and OWL technologies, which are discussed in the following.
(b)
Resource Description Framework Schema (RDFS): RDFS (Brickley & Guha, 2004) is a language for describing vocabularies and adding schema to RDF. Where a vocabulary is a set of classes with specific properties that use the RDF data model to provide essential elements to model a domain. It also specifies entailment rules/axioms to infer new triples. In essence, RDFS provides a way to create domain-specific vocabulary, often referred to as a minimal ontology, and can be used to make and infer domain statements.
(c)
Web Ontology Language (OWL): OWL (Bechhofer et al., 2004) is a family of semantic web languages designed to formally represent complex knowledge of an application domain. Since it is based on computational logic, computer programs can utilize the knowledge expressed in OWL. It expands upon RDFS by introducing additional constructs for defining classes and properties, such as cardinality constraints, disjoint classes, etc. The World Wide Web Consortium (W3C) has standardized three variants of OWL with increasing expressive power: OWL-Lite, OWL-DL, and OWL-Full. OWL-DL, which offers maximal expressiveness while ensuring computational completeness and decidability, is particularly suited for our modeling purposes. Description logics (DLs; Baader et al., 2010) are decidable fragments of first-order logic, and provide a logical foundation to OWL. Since the expressive power of OWL-DL is sufficient to model the information and the constraints required in our proposed approach, we focus on the $S H O I N^{(D)}$ DL and OWL-DL. Certainly, the ontology can also be represented in OWL 2 DL format.

An OWL-DL ontology can be viewed as a pair ( $T$ , $A$ ), where $T$ represents the terminological box (TBox) containing definitions of concepts and properties using vocabulary terminology. The assertional box (ABox), represented by $A$ , is used to provide the membership assertions—either as concepts or as properties, and is the place where the details of a given concrete situation (extracted from word-problem-text) in the domain are presented.
(d)
Semantic Web Rule Language (SWRL): SWRL (Horrocks et al., 2005) is a rule language for the Semantic Web, which was developed to address the limitations of OWL in making assertions using properties, which are important for practical applications (Horrocks et al., 2005). While OWL has significant expressive power, it lacks the capability to represent complex logic involving properties. SWRL extends OWL with Horn-clause rules, enabling more advanced reasoning and overcoming these expressive constraints. In the context of OWL-DL, DL-safe SWRL rules (Motik et al., 2005) provide feasible reasoning. A DL-safe SWRL rule takes the form $a_{1} \land a_{2} \land \dots \land a_{k} \to a_{k + 1} \land a_{k + 2} \land \dots \land a_{n}$ , where each $a_{i}$ is an atomic unit. Theoretically, each atom represents either $C (a)$ or $P (b, c)$ , where C denotes a class, P denotes a property, and $a, b,$ and $c$ are either individuals or variables. In the TC domain, we develop and utilize SWRL rules for several purposes, including checking the feasibility of object transfers, updating ownership of quantities after an object transfer, and other related tasks.
(e)
SPARQL Protocol and RDF Query Language (SPARQL): The W3C has endorsed SPARQL (Prud’hommeaux & Seaborne, 2008) as a query language for querying RDF graphs. SPARQL allows users to express queries across diverse data sources, whether the data are available as a native RDF graph or accessed through middleware. SPARQL supports four types of queries: SELECT, CONSTRUCT, ASK, and DESCRIBE. In the TC domain, we focus on SELECT queries, which are used to retrieve specific information from the RDF graph of a TC word problem. Consequently, our discussion here is limited to SELECT queries. A SELECT query has two main components: a list of variables to be retrieved and a WHERE clause that specifies the triple patterns to match. In the query, variables are denoted by a “?” followed by the variable name, such as “?x.” The triple pattern itself consists of three placeholders, each of which can be either a variable or a specific keyword, used to define the search criteria. For example, the pattern ex:Stephen foaf:knows ?person in a SPARQL query can be used to search an RDF graph to find all the people that Stephen knows. In the TC domain, we use SELECT-type SPARQL queries to retrieve the information sought in the QS of a TC word problem.

3.2. TC-Ontology

TC-Ontology, developed using OWL-DL, was initially proposed in the KLAUS-Tr system (Kumar & Kumar, 2024) and was reused (after appropriate extension) in the validity-checker system (Kumar & Sreenivasa Kumar, 2024). In the proposed work, we reuse the extended TC-Ontology (Kumar & Sreenivasa Kumar, 2024) after making the appropriate changes. Kumar and Kumar (2024) analyzed the various TC-AWPs and first devised the vocabulary of the TC-Ontology. The summary is as follows:

Concepts/Classes: The authors treat each AWP as an individual belonging to the class Word-Problem. They propose a categorization of the word-problem sentences into individuals belonging to the following ontology classes: BT, TR, AT, and QS. The BT class contains sentences that carry the agent quantity and associated information before the object transfer. Similarly, the AT class contains the sentences that carry the posttransfer information. The TR class has sentences that contain object transfer information. The QS class contains the query sentences that seek the information from the following: BT facts, AT facts, or TR being carried out. To represent the knowledge present in a sentence, the following concepts are devised: Agent, TC-Quantity, PositiveQuantity, NegativeQuantity, etc. The specific agent, such as Mike, and a specific number in the problem text become individual members of the classes Agent and TC-Quantity, respectively.

Properties: Table 1 presents the essential properties (along with domain and range information) available in the TC-Ontology. A domain is a concept/class to which the subject of an RDF⁵ statement using a given property belongs to, while range is the class of the statement’s object (value).

In addition to concepts and properties, TC-Ontology is equipped with axioms inferring the quantities involved in the math operations (w.r.t. the TC-AWP domain, subtraction, and addition). Appendix 8 presents the essential axioms of the TC-Ontology. To solve a given AWP, KLAUS-Tr system (Kumar & Kumar, 2024) utilizes ontology inferences (made by axioms) and SWRL (Horrocks et al., 2005) rules that capture the knowledge about how object transfer should affect the RDF graph of the word-problem being solved.

Table 1.
Important Properties Devised in TC-Ontology

Property	Domain	Range
hasBT	Word-Problem	BT
hasTR	Word-Problem	TR
hasAT	Word-Problem	AT
hasQuestion	Word-Problem	QS
fromAgent	TR	Agent
toAgent	TR	Agent
hasQuant	Agent	TC-Quantity
hasLost	Agent	TC-Quantity
hasGained	Agent	TC-Quantity
quantValue	TC-Quantity	Literal
quantType	TC-Quantity	Literal

Note. BT = before transfer; TR = transfer; AT = after transfer; TC = transfer case.

Note that a source-to-destination edge in the RDF graph can also be viewed as a triple as follows (source-individual, edge-label, and destination-individual). In the discussion below, we also use triples for ease of understanding. The RDF graph is also referred to as the ABox (refer Appendix 8) in this paper. Figure 4 shows the RDF representation of a simple TC-AWP given in Figure 1. This RDF representation utilizes the vocabulary described above. In Section 4.1, we show how to use the graphical representations of the simple TC-AWPs and add more edges to the graph to obtain the representations of the complex TC-AWPs. This section provides only the necessary details of the TC-Ontology. Appendix 8 provides more details on TC-Ontology and its use in the KLAUS-Tr (Kumar & Kumar, 2024) and OLGA (Kumar & Sreenivasa Kumar, 2024) systems.

Why do we need to extend the TC-Ontology: During the generation of complex TC-AWPs, we need to maintain the order of the sentences; therefore, we add a new data property hasSequenceNumber. The property takes domain and range as {BT $⊔$ TR $⊔$ AT $⊔$ QS} and Integer, respectively. Section 4.1 details how the generation module utilizes the hasSequenceNumber property.

Since the KLAUS-Tr system focused on solving simple TC-AWPs (generally triggering one subtraction and one addition), the ontology and the SWRL rules developed were sufficient. However, to solve complex TC-AWPs, the proposed system needs to perform a sequence of reasoning steps. The SWRL rule shown in Figure 2 (developed using a similar logic as proposed in KLAUS-Tr), to perform the sequence of reasoning, will make the ontology inconsistent, as it will run forever and attempt to assign an infinite number of values to quantValue property.

Figure 2.

This Semantic Web Rule Language (SWRL) rule will make the ontology inconsistent.

Therefore, to perform the sequence of reasoning, we use two additional properties and the Owlready2⁶ Python library. In the following, we explain the required additional properties.

We add hasTransferSequence data property to capture the sequence of the reasoning (i.e., sequence of the object transfers taking place). The domain and range are TR and Integer, respectively.

We add hasUpdatedValue data property to capture the effect of the sequential reasoning. If the value of any quantity gets updated using an SWRL rule and it is required in the transfer that follows, it creates the sequential reasoning situation. The domain and range are TC-Quantity and literal, respectively.

Using Owlready2, we write the updated quantity value to the ontology and rerun the reasoner to perform the reasoning required by the next object transfer. Section 4.2 details how these properties are used in the solution design.

4. The Proposed Approach

This section presents our proposed approach, which consists of two components: a generator and a solver. Our work focuses on a specific subset of TC-AWPs. Figure 3 shows the overall system architecture of the proposed approach. The preprocessed text of simple TC word problems is transformed into ontological representations through three steps: sentence classification, a bidirectional encoder representations from transformers (BERT)-based language model, and a custom script that populates the ontology ABox. On the generation side, we address the issue of invalid machine-generated problems. We introduce a methodology that constructs complex TC problems from the ontological representations of simple problems, ensuring that the generated outputs are always valid. On the solving side, we use these ontology-validated complex problems to evaluate the problem-solving capabilities of two LLMs, ChatGPT and Gemini. We further demonstrate that our hybrid ontology- and machine learning (ML)-based solver achieves higher accuracy than both LLMs when solving these complex TC problems. We discuss the generation and solving aspects of our approach in the following sections.

Figure 3.

Overall system diagram of the proposed approach.

Figure 4.

RDF representation of the simple TC-AWP (Agent1: Stephen; Agent2: Daniel). The red circle represents the factual details of the transfer. The term tc represents the namespace of the TC-Ontology. Note. RDF = resource description framework; TC-AWP = transfer case-arithmetic word problem.

4.1. Generating Complex TC-AWPs

As mentioned earlier, the existing AWP datasets contain simple TC-AWPs. Therefore, we generate complex TC-AWPs as we plan to assess the solving capabilities of LLMs over these examples. The generation process makes use of the TC-Ontology.

Kumar and Sreenivasa Kumar (2024) extended the TC-Ontology to check the mathematical validity of the machine-generated TC-AWPs. Additionally, they showed a way to convert the single-transfer TC-AWPs into two-transfer TC-AWPs. In this work, we adopt a similar idea, use the RDF representations of the simple TC-AWPs, and generate complex TC-AWPs (up to four object transfers).

Generation Process: Since TC-Ontology models transfer (of objects) as a concept, it is possible to add more individuals of this kind. For example, in the simple TC-AWP given in Figure 1, the transfer-type sentence “Stephen gave 12 books to Daniel” is an individual of the concept transfer. To generate complex TC-AWPs, first, we add more hasTR-type and other associated edges (equivalent to adding more individuals of type “Transfer”) to the graphical representations of the simple TC-AWPs. For example, we show a graphical representation of a complex TC-AWP in Figure 5, which is obtained after adding edges to the graph shown in Figure 4.

Figure 5.

RDF representation of the complex TC-AWP (Agent1: Stephen; Agent2: Daniel; Agent3: Mike). To make the diagram simple and understandable, we do not show the edges representing factual details about the object transfers and the quantities owned by the agents. Note. RDF = resource description framework; TC-AWP = transfer case-arithmetic word problem.

The generation module takes input as ontology representation ( $O_{s i}$ ) of the simple TC-AWPs and the number of object transfers expected ( $N_{tr}$ ) in the generated word problems. $s i$ represents the $i^{th}$ simple TC-AWP. First, we generate triples representing additional BT-type sentences as they are required to create an appropriate context for multiple object transfers. These additional BT-type sentences will introduce a new agent and maybe a new object type (if the existing scenario has only one object). We generate up to two new BT-type sentences (depending on how many BT-type sentences exist in the simple TC-AWP). We ensure that the generated problems contain three BT-type sentences. Based on the value of $N_{tr}$ , the proposed system generates triples representing an additional object transfer(s). The total additional triples ( $T_{A}$ ), that is, triples of BT-type and TR-type sentences, are then accumulated with the “triples of simple TC-AWP ( $T (O_{s i})$ )” and are taken into $T (O_{c i})$ , which is a triple representation of the complex TC-AWP. $c i$ represents the $i^{th}$ complex TC-AWP. Generating triples of BT-type sentence: A BT-type sentence (e.g., Stephen has 12 books) requires one agent-name string, a number, and one object-type string. We extract “agent-names and object-types” information from the problem texts of simple TC-AWPs. The agent-name string and the number can be randomly initialized. However, semantic similarity checking (w.r.t. the existing object-type strings) is required to assign the object-type string. We utilize the spaCy⁷ library and choose the string that is most semantically similar to the existing object-type strings (which exist in the problem text of the simple TC-AWP). For instance, a complex TC-AWP generated from a simple TC-AWP (which talks about books and pens) should omit new object types such as cars, bikes, etc.

Generating triples of TR-type sentence(s): Our approach utilizes the structural information obtained from the triples of the first transfer and ABox information (such as agent-names and object-type) available in the ontology and generates triples of the additional transfer. Note that the information about the number of quantities held by the agents (involved in the additional transfer) is available in the existing graph; therefore, the quantity for the additional TR-type sentence is generated appropriately.

The sentences belonging to the simple TC-AWPs are available as annotations in the ontology. We convert the additionally generated triples into sentences using a template-based program script. We use one template for each sentence category. Note that the sequence numbers of the sentences belonging to the “simple TC-AWPs” are learned at the preprocessing stage and maintained in the ontology. However, these sequence numbers are adjusted once new sentences are generated. The newly generated BT-type sentence is placed at the beginning, and TR-type sentence(s) is/are placed after the existing TR-type sentence. Finally, we arrange all the sentences using the value of hasSequenceNumber property and form the complex TC-AWPs.

Quality of the generated examples: The existing approaches use the following measures to assess the quality of the generated examples: (a) Language Quality Measures: DL-based approaches primarily follow the idea of “predict the next word given the previous few words,” and for assessing the quality of the generated text, they adopt metrics such as BLEU-4 (Papineni et al., 2002), METEOR (Lavie & Agarwal, 2007), and ROUGE-L (Lin, 2004), etc. Since the proposed approach is ontology-based and does not generate word-by-word text, we do not use the above metrics. However, we focus on another measure to assess the quality, that is, checking mathematical validity. (b) Mathematical Validity Measure: As mentioned in the introduction section, a generated problem is considered valid if it consists of sufficient information to answer the posed QS. We primarily focus on the mathematical validity aspect, which is discussed in the following.

The generated problems are valid: As previously discussed, the information extracted from the simple TC word problems (which are all valid) is represented as an RDF graph (i.e., the ABox of the ontology). To generate complex TC word problems, the ontological representation of a word problem needs to be expanded. This expansion involves adding more triples to represent additional BT and TR types of sentences. The generated problems are grammatically correct because the newly created sentences are template-driven. Additionally, the structure of existing triples is used to generate new, similar triples. Our generation process is supported by a domain ontology, ensuring that every generated example is valid by leveraging the structural information from the existing triples. To ensure the problem is also mathematically correct, we generate appropriate quantities for BT and TR types of sentences. Consequently, the new edges (i.e., triples) representing an additional object transfer involve only a feasible object transfer scenario. We show a conceptual illustration in Figure 6. Also, in Figure 7, we show how to generate triples denoting an additional object transfer by leveraging information from existing triples.

Figure 6.

How the generated transfer case (TC)-word-problems can become invalid.

Figure 7.

Semantic Web Rule Language (SWRL) rule showing the generation of additional object transfer.

4.2. Solving Complex TC-AWPs

Algorithm 1 shows the pseudo-code of the proposed solver. It takes input as TC-Ontology $O$ and $i^{t h}$ complex TC-AWP (WP $_{i}$ ). Split-Sentences() function (line 4) splits the word problem sentences and stores them in the set S. The classifier (line 6) takes each word problem sentence and predicts the class label (i.e., BT, TR, AT, or QS). Text classification module achieves 100% accuracy as we use: (a) OpenAI API, which provides integration of powerful LLMs with Scikit-learn library. We leveraged the Zero-shot text classifier with OpenAI model gpt-3.5-turbo. (b) Sentences belonging to the four categories mentioned above have different structures, and sufficient examples are available for training the classifier. Based on the type (label) of the sentence, the system first extracts relevant information from the sentences and populates it into the RDF graph (also called ABox). This information involves the details about the agents and the quantities they own, the fine details about a transfer (such as who is losing, who is gaining, the exact amount of the quantity, etc.), and what is asked in the QS.

Pop-Onto() is a BERT-based language model trained to pick sentence parts (agent names, etc.) from the word problem text. For ABox extraction from simple TC-AWPs, OLGA (Kumar & Sreenivasa Kumar, 2024) trained and deployed BERT-based language models (one for each sentence category). Since the number of sentence categories is the same in the simple and complex problems, we adopt the BERT-based model proposed in OLGA and leverage it for extracting ABox information from the problem texts of complex TC-AWPs. Compound sentences (such as Stephen has 5 books and 10 pens) are converted into two simple sentences using subject distribution at the preprocessing stage.

We develop the SWRL rule to perform reasoning in multiple object transfer situations. The value of the hasTransferSequence property (i.e., (l) represents the sequence of the object transfer. Sync-Reasoner (Algorithm 1, Step 10) is a core function in the Owlready2 Python library that triggers the ontology’s reasoning process. Since SWRL rules are integrated within the ontology, the reasoning process also considers the constructs defined by these rules. At each iteration, the hasUpdatedValue(q, v) atoms (refer to Figure 8) are indeed used to update the corresponding quantValue(q, v) atoms before reapplying the SWRL rules. Figure 8 presents the SWRL rule and its explanation. $O_{i l}$ represents the ontology state after the $l^{th}$ transfer. We leverage Pellet ontology reasoner Sirin et al. (2007) in our solution. We place a SPARQL query (based on what the QS sentence asks) once the $O_{i l}$ is updated after the $n^{th}$ object transfer.

Figure 8.

Semantic Web Rule Language (SWRL) rule to affect object transfer.

5. Experimental Setup and Results

The key experimental subtasks in this work are: (i) sentence classification and (ii) information extraction for ontology population (ABox construction). Both components are essential for generating and solving complex TC-AWPs. Sentence classification is discussed in detail in Section 4.2. In this section, we present the results of ABox extraction and the overall word-problem-solving performance. All experiments were conducted on a macOS system equipped with 16 GB RAM and an Apple M1 processor. For integrating LLMs such as GPT-3.5 into the Scikit-learn workflow to support the sentence-classification task, we used Google Collab.⁸ ABox extraction was performed using BERT-based language models. Ontology development and editing were carried out using the Python owlready2 library and the Protégé⁹ tool.

Since the proposed approach is different from a typical ML/DL system, we provide details about reproducing the results in Appendix 10. The datasets, code, and all related resources are available at the following GitHub repository: https://github.com/projects-by-sk/phd-rp3.1/.

5.1. Dataset

We use the dataset AllArith-Tr Kumar and Kumar (2024), which contains simple TC-AWPs, and generate the complex TC-AWPs using the proposed ontological approach. We used only 200 problems (50 for each N $_{tr}$ value, see Figure 9) for the assessment, as responses from the LLMs were manually inspected for checking Natural language understanding (NLU)/reasoning failures (see Section 5.3). The average word counts are as follows: 44 (when $N_{tr} = 2$ ), 56 (when $N_{tr} = 3$ ), and 63 (when $N_{tr} = 4$ ). The proposed system can convert a simple TC-AWP into a complex TC-AWP. Therefore, the total number of complex TC-AWPs the proposed system can generate depends on the number of simple TC-AWPs given in the input.

Figure 9.

Comparing accuracy-of-solving of the proposed system w.r.t. LLMs. N $_{tr}$ represents number of object transfers in TC-AWPs. Note. LLM = large language model; TC-AWP = transfer case-arithmetic word problem.

5.2. ABox Extraction

As mentioned in Section 4.2, we use BERT-based language models for extracting the ABox information for TC-Ontology (the idea was proposed by OLGA; Kumar & Sreenivasa Kumar, 2024). OLGA used 1K and 2K TC-AWP (these word problems included single object transfer only) sentences for training and achieved 76% ABox prediction accuracy (i.e., for 76% of TC-AWPs, ABox was extracted correctly, we refer to this as joint accuracy). However, increasing the number of object transfers results in a longer problem text, which lowers the joint accuracy. In Algorithm 1, we name the ABox prediction task PopOnto(), as the proposed system needs to populate the ABox into the ontology once it has extracted the information from the word problem text. On complex TC-AWPs, we achieve the following accuracy: 72% (#object-transfers = 2), 66% (#object-transfers = 3), and 58% (#object-transfers = 4).

Since ontology-based reasoning is built on deterministic constructs and SWRL rules, once sentence parts are correctly mapped to the ontology classes (i.e., successful ABox construction), the system deterministically produces the correct solution. Note that, given a correct ABox, an ontology-based solver always does the correct reasoning, provided domain knowledge is also encoded appropriately.

5.3. Solving Complex TC-AWPs

The results (Figure 9) indicate that as the complexity of the examples increases, the solving accuracy of LLMs declines sharply. In contrast, the proposed approach maintains strong performance across all complexity levels. This robustness arises from its use of explicit domain knowledge during the solving process, enabling it to handle complex reasoning more reliably. Next, we discuss the failed cases of the LLMs.

Analyzing Gemini and ChatGPT Results: We evaluated the systems along two dimensions–NLU and mathematical reasoning. For NLU, we examined the semantic and pragmatic adequacy of the problem interpretations produced by the LLMs, and for reasoning, we assessed whether the models performed the correct mathematical inference over the interpreted representations. Our manual analysis of the failed cases revealed that ChatGPT consistently processed the problem text correctly but failed primarily at the reasoning stage. In contrast, Gemini exhibited notable NLU deficiencies: on average (across N $_{tr}$ = 2, 3, and 4), it misinterpreted the problem text in approximately 34% of the failed cases, which subsequently led to incorrect reasoning. Additionally, both LLMs produced inconsistent answers across multiple runs for the same failed examples, indicating instability in their internal reasoning pathways. Table 2 reflects this breakdown; for instance, for $N_{tr} = 2$ , Gemini’s failures were attributable to misunderstanding the problem text in 22.22% of cases, while the remaining 77.77% resulted solely from incorrect reasoning.

Table 2.
Gemini Versus ChatGPT—We Analyze the Failed Cases (Examples for Which LLMs Gave Wrong Answers) and Report What Percentage of These Examples Were Failed due to NLU and Reasoning.

Failing at NLU Failing at reasoning

$N_{tr}$ Gemini ChatGPT Gemini ChatGPT

2 22.22 0 77.77 100

3 35.48 0 64.51 100

4 42.10 0 57.89 100

	Failing at NLU
2	22.22	77.77	100
3	35.48	64.51	100
4	42.10	57.89	100

Note. LLM = large language model; NLU = natural language understanding.

6. Limitations of the Proposed Approach

While encoding and utilizing domain knowledge offer several advantages, it also presents some limitations. During the generation process, the encoded domain knowledge ensures the creation of mathematically valid examples. However, because the proposed system converts ontology triples (as part of the solution) into English text using predefined templates, language diversity may be diminished. Addressing this limitation will be a focus of our future work. Another potential limitation is the extension of the proposed approach to other types of AWPs. This would require the development of a separate ontology for the target domain. Nonetheless, the ML and deep learning (DL) based submodules of the solution are adaptable and can be extended to other types of AWPs.

7. Conclusions

We investigate the effectiveness of LLMs— in solving complex TC-AWPs. Although the latest SOTA LLMs, such as ChatGPT and Gemini, demonstrate remarkable proficiency in comprehending natural language, they encounter significant difficulties when it comes to solving these complex TC word problems. To address these challenges, we employ a strategy that incorporates domain knowledge, which is encoded through domain ontology and SWRL rules. This approach not only aids in generating valid complex word problems (from the ontological representation of simple problems) but also helps in effectively solving these complex examples. The ontology-based modeling proved effective in tackling complex word problems. The idea of utilizing domain knowledge can be extended to more sophisticated and challenging domains.

Footnotes

Acknowledgement

The authors thank IIT Madras for permitting the use of contingency funds to cover the APC charges.

Authors’ Note

Most of the work was carried out at IIT Madras.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iD

Suresh Kumar

Notes

Ontology: Details and Its Use in KLAUS-Tr and OLGA Systems

An ontology is a structured representation of knowledge that defines the concepts within a specific domain and the relationships between them. Ontologies typically consist of classes (representing concepts or categories), properties (depicting relationships between classes), and instances (individual members of classes). By organizing information in a hierarchical and interconnected manner, ontologies facilitate better knowledge management, interoperability, and reasoning about the entities and their interactions within a given domain.

Two major components of an ontology are the TBox and the ABox. The TBox defines the vocabulary, concepts, and their relationships within a specific domain, establishing a structured framework for understanding and representing knowledge. The ABox, on the other hand, captures instances and specific data related to those instances, providing a means to describe individual objects and their properties within the defined concepts. These components together form a comprehensive ontology, facilitating effective knowledge representation and sharing.

Axioms are logical statements of the TBox that say what is true in an application domain. In the following, we mention the important axioms devised for the TC-AWP domain. Here, A.01 to A.04 are concept inclusion axioms, whereas A.05 and A.06 are concept equivalence axioms.

A.01: ∃ hasQuant. TC-Quantity ⊑ Agent

(Anyone who owns a TC-Quantity is an agent)

A.02: TC-Quantity ⊑ PositiveQuantity

(Every TC-Quantity is a positive quantity)

A.03: MinuendQuantity ⊑ TC-Quantity

A.04: SubtrahendQuantity ⊑ TC-Quantity

A.05: MinuendQuantity ≡ (TC-Quantity $⊓$ ⊓

$\exists$ ∃ isOwnedBy. Agent)

(A.05 expresses “Minuend quantity is a TC quantity that is owned by an agent”)

A.06: SubtrahendQuantity ≡ (TC-Quantity $⊓$ ⊓

$\exists$ ∃ isGainedBy. Agent ⊓ ∃ isLostBy. Agent)

(A.06 expresses “Subtrahend quantity is a TC quantity that is gained by an agent and lost by an agent”)

Note that the properties isOwnedBy, isGainedBy, and isLostBy are inverse properties of the object properties hasQuant, hasGained, and hasLost, respectively. Axioms A.05 and A.06 infer the minuend and subtrahend quantities of a subtraction operation, respectively.

In Section 3, we presented a summary of the vocabulary of the TC-Ontology, and some important axioms are mentioned above. The detailed information of the TC-Ontology is available in KLAUST-Tr (Kumar & Kumar, 2024). Both KLAUS-Tr and OLGA systems process word problems sentence-wise. Based on the type of sentence, these systems extract important information and populate the Ontology ABox. An example ABox extracted from a word problem is given in Figure A.1. In contrast, P1 represents an example word problem. Q1, Q2, and Q3 represent quantities and T1 represents the transfer. The data property assertions for quantities Q1, Q2, and Q3 (using the quantValue and quantType data properties) are straightforward and thus not shown.

Appendix B. Solving Complex TC-AWPs With ChatGPT and Gemini

The results on an example TC-AWP involving three object transfers are given below:

Q1. Mike has 14 cars and 22 bikes. John has 10 bikes and 2 cars. Jane has only 5 bikes. John gave 2 bikes to Jane and took 5 cars from Mike. John gave 2 bikes to Jane. How many bikes does John have now?

Correct Answer: 6

(Both ChatGPT and Gemini gave the answer as 8)

Response by ChatGPT:

Response by Gemini:

The results on an example TC-AWP involving four object transfers are given below:

Q2. Jill picked 34 seashells from the beach, and Jane picked 78 seashells. They visited Tim’s house, and Jill gave 13 seashells to Tim. Initially, Tim had 45 seashells. Jane gave 45 seashells to Tim. The next day, Jill took 23 seashells from Tim, and Jane gave 12 seashells to Tim. How many more seashells does Tim have than Jane?

Correct Answer: 71

(Gemini and ChatGPT gave answers 14 and 21, respectively)

Response by Gemini:

Response by ChatGPT:

Appendix C. Reproducing the Results

Note that the implementation of the proposed system is different from a typical ML/DL model. There are three key components of the proposed system: (A) Sentence classification: we use zero-shot text classification with OpenAI GPT-3.5-turbo model, (B) Extract important information from the word problem sentences—we use BERT-based language models, and (C) Ontology editor—we use Protégé tool and Owlready2 python library.

The overall process flow of the proposed solver is: The sentences of the word problem at hand are labeled using the sentence-classification module. We deploy BERT-based language models (we use the architecture proposed by the OLGA system) to extract important information from the sentences (based on the labels). This information is populated into Ontology using the Owlready2 Python library (populating the ABox of ontology). A domain ontology has two components: TBox and ABox. Using the Protégé tool, we encode the domain knowledge about transfer-type word problems (TBox of the Ontology). To solve a given word problem, we utilize the encoded domain knowledge, ABox information, and SWRL rules (we develop these rules and make them available inside the ontology, under the SWRL tab) that update the state of the ontology (i.e., computes the effects of the object transfer). Therefore, building two splits (train and test) is relevant to the components/modules (A) and (B) only, which are similar to the modules used in the existing systems KLAUS-Tr Kumar and Kumar (2024) and OLGA Kumar and Sreenivasa Kumar (2024), respectively. Therefore, we skip these details.

Appendix D. Abbreviations

(A) Key abbreviations in the AWP domain or introduced in this work are:

(B) Key ontology and ML abbreviations used in this work are:

References

Baader

Calvanese

McGuinness

D. L.

Nardi

Patel-Schneider

P. F.

(2010). The description logic handbook: Theory, implementation and applications (2nd ed.). Cambridge University Press.

Bechhofer

van Harmelen

Hendler

Horrocks

McGuinness

Patel-Schneijder

Stein

L. A.

(2004). OWL Web Ontology Language Reference. World Wide Web Consortium (W3C) Recommendation. See http://www.w3.org/TR/owl-ref/.

Brickley

Guha

(2004). RDF Vocabulary Description Language 1.0: RDF Schema. World Wide Web Consortium (W3C) Recommendation. http://www.w3.org/TR/2004/REC-rdf-schema-20040210/.

Brown

Mann

Ryder

Subbiah

Kaplan

J. D.

Dhariwal

Neelakantan

Shyam

Sastry

Askell

Agarwal

Herbert-Voss

Krueger

Henighan

Child

Ramesh

Ziegler

Winter

Chen

Amodei

(2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Chen

Tworek

Jun

Yuan

de Oliveira Pinto

H. P.

Kaplan

Edwards

Burda

Joseph

Brockman

Ray

Puri

Krueger

Petrov

Khlaaf

Sastry

Mishkin

Chan

Gray

Pavlov

Zaremba

(2021). Evaluating large language models trained on code.

Chowdhery

Narang

Devlin

Bosma

Mishra

Roberts

Barham

Chung

H. W.

Sutton

Gehrmann

Schuh

Shi

Tsvyashchenko

Maynez

Rao

Barnes

Tay

Shazeer

Prabhakaran

Fiedel

(2022). Palm: Scaling language modeling with pathways.

Horrocks

Patel-Schneider

P. F.

Bechhofer

Tsarkov

(2005). Owl rules: A proposal and prototype implementation. Journal of Web Semantics, 3(1), 23–40. https://doi.org/10.1016/j.websem.2005.05.003. http://www.sciencedirect.com/science/article/pii/S1570826805000053. Rules Systems

Huang

Shi

Lin

C. Y.

Yin

W. Y.

(2016). How well do computers solve math word problems? Large-scale dataset construction and evaluation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 887–896). Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1084. https://www.aclweb.org/anthology/P16-1084.

Klyne

Carroll

J. J.

(2004). Resource description framework (RDF): Concepts and abstract syntax. W3C recommendation. http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/.

10.

Koncel-Kedziorski

Konstas

Zettlemoyer

Hajishirzi

(2016a). A theme-rewriting approach for generating algebra word problems. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1617–1628). Association for Computational Linguistics. https://doi.org/10.18653/v1/D16-1168. https://aclanthology.org/D16-1168.

11.

Koncel-Kedziorski

Roy

Amini

Kushman

Hajishirzi

(2016b). MAWPS: A math word problem repository. In: Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies (pp. 1152–1157). Association for Computational Linguistics. https://doi.org/10.18653/v1/N16-1136. https://www.aclweb.org/anthology/N16-1136.

12.

Kumar

P. S.

(2024). KLAUS-Tr: Knowledge & learning-based unit focused arithmetic word problem solver for transfer cases. Natural Language Engineering, 30(1), 96–131. https://doi.org/10.1017/S1351324922000511 .

13.

Kumar

Sreenivasa Kumar

(2024). Validity checking and repairing of machine generated transfer-type word problems. Applied Ontology, 19(4), 368–388, https://doi.org/10.1177/15705838241303829.

14.

Lavie

Agarwal

(2007). METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the second workshop on statistical machine translation (pp. 228–231). Association for Computational Linguistics. https://aclanthology.org/W07-0734.

15.

Lin

C. Y.

(2004). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81). Association for Computational Linguistics. https://aclanthology.org/W04-1013.

16.

Liu

Fang

Ding

Liu

(2021). Mathematical word problem generation from commonsense knowledge graph and equations. In: M. F. Moens, X. Huang, L. Specia and S. W. Yih (Eds.), Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 4225–4240). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.348. https://aclanthology.org/2021.emnlp-main.348.

17.

Luo

Sun

Zhao

Lou

Tao

Geng

Lin

Chen

Zhang

(2023). Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.

18.

Motik

Sattler

Studer

(2005). Query answering for OWL-DL with rules. Web Semantics, 3(1), 41–60. https://doi.org/10.1016/j.websem.2005.05.001.

19.

Niyarepola

Athapaththu

Ekanayake

Ranathunga

(2022). Math word problem generation with multilingual language models. In S. Shaikh, T. Ferreira, & A. Stent (Eds.), Proceedings of the 15th international conference on natural language generation (pp. 144–155). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.inlg-main.12. https://aclanthology.org/2022.inlg-main.12.

20.

Papineni

Roukos

Ward

Zhu

W. J.

(2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the association for computational linguistics (pp. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135. https://aclanthology.org/P02-1040.

21.

Polozov

O’Rourke

Smith

A. M.

Zettlemoyer

Gulwani

Popovic

(2015). Personalized mathematical word problem generation. In: Proceedings of the 24th international conference on artificial intelligence, IJCAI’15 (pp. 381–388). AAAI Press.

22.

Prud’hommeaux

Seaborne

(2008). SPARQL query language for RDF. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query/.

23.

Qin

Liu

Huang

Zhang

Liu

Jin

Chen

(2023). A mathematical word problem generator with structure planning and knowledge enhancement. In: Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, SIGIR’23 (pp. 1750–1754). Association for Computing Machinery. https://doi.org/10.1145/3539618.3591937.

24.

Rae

J. W.

Borgeaud

Cai

Millican

Hoffmann

Song

H. F.

Aslanides

Henderson

Ring

Young

Rutherford

Hennigan

Menick

Cassirer

Powell

van den Driessche

Hendricks

L. A.

Rauh

Huang

Welbl

Irving

(2021). Scaling language models: Methods, analysis & insights from training gopher. CoRR abs/2112.11446. https://arxiv.org/abs/2112.11446.

25.

Robert

James

Sandra

Richard

Allan

(2011). Automating generation of textual class definitions from OWL to English. Journal of Biomedical Semantics, 2(Suppl. 2), S5. https://doi.org/10.1186/2041-1480-2-S2-S5.

26.

Roy

Roth

(2017). Unit dependency graph and its application to arithmetic word problem solving. In: Proceedings of the thirty-first AAAI Conference on Artificial Intelligence, AAAI’17 (pp.3082-3088). AAAI Press.

27.

Roy

Roth

(2018). Mapping to declarative knowledge for word problem solving. Transactions of the Association for Computational Linguistics, 6, 159–172. https://transacl.org/ojs/index.php/tacl/article/view/1319

28.

Sirin

Parsia

Grau

B. C.

Kalyanpur

Katz

(2007). Pellet: A practical OWL-DL reasoner. Web Semantics, 5(2), 51–53. https://doi.org/10.1016/j.websem.2007.03.004.

29.

Tay

Dehghani

Tran

V. Q.

Garcia

Wei

Wang

Chung

H. W.

Shakeri

Bahri

Schuster

Zheng

H. S.

Zhou

Houlsby

Metzler

(2023). Ul2: Unifying language learning paradigms.

30.

Thoppilan

Freitas

D. D.

Hall

Shazeer

Kulshreshtha

Cheng

Jin

Bos

Baker

Lee

Zheng

H. S.

Ghafouri

Menegali

Huang

Krikun

Lepikhin

Qin

(2022). LaMDA: Language models for dialog applications. CoRR abs/2201.08239. https://arxiv.org/abs/2201.08239.

31.

Touvron

Martin

Stone

Albert

Almahairi

Babaei

Bashlykov

Batra

Bhargava

Bhosale

Bikel

Blecher

Ferrer

C. C.

Chen

Cucurull

Esiobu

Fernandes

Gao

Scialom

(2023). Llama 2: Open foundation and fine-tuned chat models.

32.

Wang

Zhang

Gao

Song

Guo

Shen

H. T.

(2018b). MathDQN: Solving arithmetic word problems via deep reinforcement learning. In S. A. McIlraith & K. Q. Weinberger (Eds.), Proceedings of the thirty-second AAAI conference on artificial intelligence (AAAI-18), the 30th innovative applications of artificial intelligence (IAAI-18), and the 8th AAAI symposium on educational advances in artificial intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018 (pp. 5545–5552). AAAI Press. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16749.

33.

Wang

Lan

Baraniuk

(2021). Math word problem generation with mathematical consistency and problem context constraints. In M. F. Moens, X. Huang, L. Specia, & S. W. Yih (Eds.), Proceedings of the 2021 conference on empirical methods in natural language processing (pp. 5986–5999). Association for Computational Linguistics. https://aclanthology.org/2021.emnlp-main.484.

34.

Wei

Wang

Schuurmans

Bosma

ichter

Xia

Chi

Q. V.

Zhou

(2022). Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in neural information processing systems (Vol. 35, pp. 24824–24837). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.

35.

Yue

Zhang

Huang

Sun

Chen

(2023). Mammoth: Building math generalist models through hybrid instruction tuning.

36.

Zhang

Wang

Dai

B. T.

Shen

H. T.

(2018). The gap of semantic parsing: A survey on automatic math word problem solvers. CoRR abs/1808.07290. http://arxiv.org/abs/1808.07290.

37.

Zhou

Huang

(2019). Towards generating math word problems from equations and topics. In: Proceedings of the 12th international conference on natural language generation (pp. 494–503). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-8661. https://aclanthology.org/W19-8661.

38.

Zong

Krishnamachari

(2023). Solving math word problems concerning systems of equations with GPT-3. Proceedings of the AAAI Conference on Artificial Intelligence, 37(13), 15972–15979. https://ojs.aaai.org/index.php/AAAI/article/view/26896.

39.

Zou

(2019). Text2Math: End-to-end parsing text into math expressions. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP–IJCNLP) (pp. 5327–5337). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1536. https://www.aclweb.org/anthology/D19-1536.

	Failing at NLU		Failing at reasoning
$N_{tr}$	Gemini	ChatGPT	Gemini	ChatGPT
2	22.22	0	77.77	100
3	35.48	0	64.51	100
4	42.10	0	57.89	100

Generating and Solving Complex Transfer Type Arithmetic Word Problems

Abstract

Keywords

1. Introduction

2.1. Approaches for AWP Generation

2.2. Approaches for AWP Solving

3. Background and TC-Ontology

3.1. Background

Table 1. Important Properties Devised in TC-Ontology

5.1. Dataset

5.3. Solving Complex TC-AWPs

7. Conclusions

Footnotes

Acknowledgement

Authors’ Note

Funding

Declaration of Conflicting Interests

ORCID iD

Notes

Ontology: Details and Its Use in KLAUS-Tr and OLGA Systems

Appendix B. Solving Complex TC-AWPs With ChatGPT and Gemini

Appendix C. Reproducing the Results

Appendix D. Abbreviations

References

Table 1.
Important Properties Devised in TC-Ontology