Sage Journals: Discover world-class research

Abstract

In this article, we consider the problem of understanding the physical properties of unseen objects through interactions between the objects and a robot. Handling unseen objects with special properties such as deformability is challenging for traditional task and motion planning approaches as they are often with the closed-world assumption. Recent results in large-language model (LLM)-based task planning have shown the ability to reason about unseen objects. However, most studies assume rigid objects, overlooking their physical properties. We propose an LLM-based method for probing the physical properties of unseen deformable objects for the purpose of task planning. For a given set of object properties (e.g. foldability, bendability), our method uses robot actions to determine the properties by interacting with the objects. Based on the properties examined by the LLM and robot actions, the LLM generates a task plan for a specific domain such as object packing. In the experiment, we show that the proposed method can identify properties of deformable objects, which are further used for a bin-packing task where the properties take crucial roles to succeed.

Keywords

Large-language models object manipulation physical property reasoning autonomous task planning plan validation

Introduction

For robots operating in unseen and unstructured environments, the ability to understand their surroundings is crucial for their autonomous execution. One of the representative approaches for achieving autonomy is task planning. Classical artificial intelligence planning aims to create a sequence of transitions to achieve a predefined goal by abstracting the semantic information of objects and actions into states. They often use formalized languages such as Stanford Research Institute Problem Solver (STRIPS)¹ or Planning Domain Definition Language (PDDL).² Many recent works use large-language models (LLMs) for task planning which have the advantages of commonsense knowledge and comprehensive power.^3–5 LLMs have shown remarkable performance in context understanding of physical relationships of objects,^6,7 which allow them to be used in robotic task and motion planning.⁸

LLM-based task planning is particularly effective for robots operating in unstructured real-world environments such as domestic settings. However, existing task planning methods (e.g. Huang et al.,⁵ Shirai et al.,⁹ Zhao et al.¹⁰) assume known objects. While there have been few recent works on LLM-based task planning with unseen objects,^10,11 they do not consider deformability of objects. Since understanding the deformability requires a large amount of data to learn^12–14 or an analytic model of object dynamics,^15–17 existing methods restrict their coverages to a limited set of objects^18,19 or learn how to manipulate deformable objects in an end-to-end fashion²⁰ in the control level. To the best of our knowledge, there has been no existing work that can reason about the properties of unseen deformable objects to use them in long-horizon task planning.

In this article, we propose an LLM-based method to infer the physical properties of unseen objects (e.g. compressibility, bendability, plastic deformability²¹) based on simple interactions between an object and a robot. In addition, we show that the discovered properties can be used to generate high-level task plans for a downstream task such as object bin-packing.

As shown in Figure 1, our method consists of (i) the Property Reasoner that detects objects and understands their properties using a visual-language model (VLM)²² along with an LLM (i.e. GPT-4o), (ii) the Domain Generator which produces actions for task planning and predicates describing object properties, (iii) the Instance Descriptor which generates a problem instance including the initial and goal states, and (iv) the Task Planner that generates a task plan to perform a downstream task where errors in the plan are detected by the Plan Validator.

Figure 1.

An overview of the proposed method for understanding physical properties of unseen objects by using the commonsense knowledge of an LLM and the interactability of robots. LLM: large-language model.

The main contributions include:

We propose a method that leverages the commonsense knowledge of LLMs to understand the physical properties of unseen deformable objects by observing a robot which interacts with objects.

We present a protocol outlining how the robot interacts with objects, along with LLM prompts to help identify their properties.

We develop a pipeline for fully automated task planning that adapts to the discovered properties of previously unseen objects.

Related work

Understanding and planning with previously unseen deformable objects require combining two areas: (i) LLM-based robotic task planning and (ii) physical property inference through perception or interaction. While LLMs excel at reasoning about object relations, existing approaches do not acquire the physical attributes required for long-horizon manipulation. Accordingly, this article reviews work on LLM-based robotics, deformable-object manipulation, and domain-knowledge generation. In addition, we clarify why current approaches cannot support task planning for unseen deformable objects.

LLM-based robotic tasks

Research on leveraging LLMs for robotics spans several major directions, including perception and grounding, interactive perception, policy and skill generation, code and tool use, safety and verification, and planning. Perception and grounding research examines how robots link language to objects and scenes,^23,24 while interactive perception explores physical interaction to reveal latent object properties that cannot be inferred from passive observation.^25,26 Code and tool use focuses on generating structured function calls or task scripts to extend robot abilities.^27,28 Safety and verification evaluate whether LLM-guided actions remain safe and constraint-compliant.^8,29 Finally, LLM-based planning investigates how language models generate high-level task structures and symbolic action sequences.^30,31

Perception and grounding research converts sensory data into scene representations using VLMs to interpret objects, relations, and task-relevant scene elements. Several approaches combine language knowledge with pretrained robotic skills or value functions to ensure that high-level commands remain feasible,^32,33 and other methods train models that jointly handle observations, instructions, and action tokens in a unified format to improve generalization and long-horizon reasoning.^34,35 Another direction uses hierarchical control where a high-level LLM coordinates lower-level modules and incorporates feedback from their execution results to refine subsequent steps.³⁶ However, reliance on static visual inputs and category-level supervision restricts the ability to infer detailed physical properties. This limitation motivates interactive perception, where robots actively probe objects to obtain richer physical signals.

To overcome these limitations, recent interactive perception research integrates LLMs with robot interaction to acquire physical information that static images cannot reveal. For example, the work of Sripada et al.²⁵ makes use of LLMs to detect physical properties such as surface stiffness, material type, and contact responses through robotic interactions with visual, tactile, and auditory sensing. Another method employs an LLM to explore in three-dimensional (3D) environments, where the model selects the targets and the controller performs interactive actions to reveal shapes, object boundaries, and orientations.³⁷ Interactive manipulation frameworks leverage LLM-based reasoning to identify affordance regions and probing strategies that probe object geometry, poses, and properties that influence manipulation outcomes.^38,39 Despite these advances, most approaches estimate properties for short-horizon control. Unlike prior work, we focuses on interaction-derived physical probing to support long-horizon symbolic reasoning.

Skill and code generation research investigates how LLMs produce executable robot behaviors through pretrained skills and code synthesis. Some approaches leverages LLMs as a planner that selects pretrained skills,⁴ while others train vision-language action (VLA) models with action tokenization so that a single policy can convert language commands into action outputs.³⁵ Additional studies prompt LLMs to derive structured subtasks that a lower-level controller executes through learned policies.^3,40 Another direction generates code or solver calls for external tools to solve optimization, constraints check, or motion planning.^27,28 Some methods define reward functions through LLMs to guide model-predictive controllers without manually engineered primitives.⁴¹

LLM-based research on safety and verification aims to ensure that robot actions generated from language models remain safe, consistent, and aligned with user-defined constraints. Recent work explores ways to augment LLM planners with auxiliary safety modules that predict risks, filter unsafe behaviors, or revise plans before execution. For example, several approaches introduce safety predictors or parallel safety agents that supervise LLM-generated plans and suppress unsafe actions.^42,43 Other methods translate natural-language instructions into formal safety constraints, such as temporal logic formulas, to guarantee that generated task plans satisfy user-specified rules.^8,44 Additional lines of research focus on detecting execution failures, revising actions based on environmental feedback, or synthesizing guardrails that intervene during runtime to prevent safety violations.^29,45–47

LLM-based planning research investigates how LLMs generate high-level task plans from natural-language instructions by leveraging their pretrained world knowledge. Earlier work demonstrates that LLMs can translate user goals into structured action sequences.⁴⁸ Subsequent approaches further refine the planning process by incorporating consistency checks or iterative reasoning mechanisms to improve reliability in long-horizon tasks.³⁰ Another method adopts a neuro-symbolic perspective, decomposing complex tasks into subgoals using an LLM and assigning each subgoal to an LLM-based search procedure.³¹ Despite these advances, existing LLM-based planners largely rely on a closed-world assumption and cannot robustly adapt their plans to previously unseen objects or novel physical attributes. To overcome this limitation, our framework uses an LLM to convert interaction-derived physical properties into symbolic knowledge, and the planner directly uses the expanded knowledge base to produce updated task plans without any additional design effort.

Deformable object manipulation

Research on manipulating deformable objects has traditionally been studied at the motion level,^12,20 which needs prior knowledge of the physical properties of objects. Based on the knowledge, they control robots to guide the objects toward the goal state. Learning-based methods^14,49–51 are proposed which have the ability to understand physical properties of deformable objects with a priori. For example, Fu et al.⁵² propose a method for unfolding clothes that learns the state classification and region segmentation of folded objects to guide the unfolding process. However, all the aforementioned methods are not easily applicable to unseen objects as they need prior knowledge or laborious data collection and training. While few methods (e.g. Shridhar et al.,⁵³ Bartsch and Farimani⁵⁴) leverage foundation models to manipulate unseen deformable objects, they work in an end-to-end manner without understanding the physical properties of the objects, which can be further used for long-horizon task planning.

Domain generation

Research on automated generation of knowledge base enables an autonomous agent to explore unseen environments. Generally, the agent expands a separated knowledge base through reasoning about its environment. For example, Hanheide et al.⁵⁵ propose a knowledge hierarchy which allows assumption-based reasoning in uncertain environments. On the other hand, recent works leverage foundation models to recognize unseen objects from visual observations.^56–58 In Liu et al.,³⁰ an LLM is used to generate domain knowledge for a given task from natural-language instructions, where the knowledge is further used for task planning. Although these methods demonstrate the ability to generalize to unseen environments and objects, they may inaccurately predict the properties of objects because they rely on predefined object sets or built-in commonsense knowledge of LLMs. Consequently, their ability to understand the physical properties of unseen deformable objects is limited due to their lack of capability to interact with and reason about their surroundings.

Recently, there has been a growing interest in identifying the physical properties of objects. Using fine-tuned LLMs trained on curated datasets, one line of research focus on estimating object properties by mapping target objects to known categories and predicting its properties based on similar examples. For instance, Gao et al.⁵⁹ demonstrate the limitations of VLMs in capturing physical characteristics and introduce a dataset specifically designed to address this gap. Similarly, Xie et al.⁶⁰ propose a grasping strategy to estimate properties such as friction, mass, and spring constant. These approaches require additional training for high-level reasoning and often struggle to generalize across diverse object categories. Another line of research focus on inferring physical properties through direct robot–object interactions. For example, Zhao et al.²⁶ leverage LLMs in conjunction with multimodal sensors including sound, torque, and tactile data to reasoning the materials. Likewise, Lai et al.⁶¹ proposes a multimodal reasoning method that integrates visual and haptic signals, using a robotic shaking action to identify liquid-containing objects. Despite their promise, these methods often depend on prior knowledge of specific physical properties and require additional training, which limits their adaptability to diverse and unstructured environments.

Our goal is to address the challenge of understanding the physical properties of unseen deformable objects without fine-tuning, enabling their use in long-horizon task planning. We aim to combine the commonsense reasoning of LLMs with physical interactions between a robot and objects.

Problem formulation

Our goal is to understand physical properties of unseen objects to generate a grounded robot task plan for $T (L ∣ C) = (l_{1}, l_{2}, \dots)$ from a language instruction $L$ (e.g. “put all objects in the box”) under constraints $C$ (e.g. “place a plastic deformable object on top of a soft object”). This task plan transitions from the initial state $S_{init}$ to the goal state $S_{goal}$ , where $l$ represents a high-level robot action. The planning structure consists of two main components: (i) a domain description $Domain$ , which includes an object description $D$ containing predicates and action description $A$ , and (ii) a description of a problem instance $Problem$ including $S_{init}$ and $S_{goal}$ . Note that $A$ is the set of lifted actions (action schema) $A = {ψ_{1}^{g}, ψ_{2}^{g}, \dots}$ , which are generated from action primitives $Ψ = {ψ_{1}, ψ_{2}, \dots}$ (e.g. pick(object, spot)). The task plan $T (L ∣ C)$ can be expressed as:

T (L ∣ C) = Planner (Domain, Problem)

where

Domain = (D, A)

and

Problem = (S_{init}, S_{goal})

. For simplicity, we use

T

interchangeably with

T (L ∣ C)

To generate $T$ to manipulate the objects, we want to discover the physical properties of them, which begins with detecting a set of $N$ objects. A set of tuple $o = (o, X)$ formulates detected objects $O = {o_{1}, o_{2}, \dots, o_{N}}$ where $X$ includes physical properties of the detected object $o$ . Other visual information of $O$ (e.g. color, dimension, and shape) can be obtained from $M$ images $I = {i_{1}, i_{2}, \dots, i_{M}}$ . For brevity, we assume that those visual information of objects is stored in $o$ without dedicated symbols. Since detection relies on visual information, we assume $o = (o, X = \emptyset)$ at this stage, which means that the physical properties of the objects cannot be inferred from the images only.

We want to construct a knowledge base $K = {(o_{1}, X_{1}), (o_{2}, X_{2}), \dots}$ where $X \neq \emptyset$ for all objects. Initially, the knowledge base could be empty or partially filled with existing knowledge. An object $o$ is considered unseen if $o \notin K$ . A probing procedure $f_{p}$ can grow the knowledge $K \leftarrow K \cup {o^{'} = (o, X \neq \emptyset)}$ where $o^{'} = f_{p} (o)$ indicates that the physical properties of the detected object $o$ can be obtained by $f_{p}$ . The finite set of robot probing actions $A_{p} = {a_{p_{1}}, a_{p_{2}}, \dots}$ (Notice that the probing actions are irrelevant to $A$ , which is the set of actions used in task planning.) construct $f_{p}$ . Once all objects in $O$ become known, the probing procedure terminates.

To determine $T (L ∣ C)$ , we aim to develop a unified framework for automated task planning with the capabilities for (i) discovering properties of unseen objects, (ii) generating a planning instance and a domain knowledge for the instance, and (iii) producing a valid task plan incorporating the discovered properties of the unseen objects.

Method

As shown in Figure 2, we develop a framework which consists of the property reasoner, domain generator, instance descriptor, task planner, and plan validator. The first three components generate a planning instance and a domain knowledge whereas the last two generate and validate task plans.

Figure 2.

The overall procedure of our method for the bin-packing domain involves autonomously investigating object properties to expand domain knowledge using a robot. The property reasoner gathers visual information about objects (e.g. color, dimensions, shape). If the knowledge base lacks the physical properties of the objects, the reasoner uses images of the objects interacting with the robot to probe these properties. Using the object information and a language instruction, the domain generator and instance descriptor create structured data for task planning. The generated task plan is then validated by the plan validator to check for syntax and semantic errors.

Property reasoner

Object detection and naming

The property reasoner detects $O$ from $I = {i_{1}, \dots, i_{M}}$ and probes their properties where the images in $I$ capture $O$ from $M$ different perspectives to accurately detect the dimension and shape. (We set $M = 2$ where more images could provide more accurate information at a cost.) To let the LLM clearly distinguish objects, we find bounding boxes of objects using a VLM.²² In addition, we draw a topological graph where each vertex is the center of the bounding box of an object. Three nearest vertices are connected to each vertex. After all objects are found, GPT-4o⁶² names all objects in the Chain of Thought⁶³ manner. Since an unseen object is unlikely to have a proper name, its label is generated by concatenating descriptive information about the object, such as green_3D_cylinder. We note that the prompts for the LLM (online Appendix A.1) consist of the processed images from the VLM and the definitions of shape, dimension, and property, $D$ . The definitions are obtained from the Cambridge Dictionary⁶⁴ and provided in online Appendix B.

Property probing

The physical properties of detected objects are reasoned by the LLM where the objects interact with a robot physically. If the number of properties is excessive, the LLM is likely to have hallucinations. Thus, we limit the set of object properties depending on the task domain. As a running example, we use the bin-packing domain where a set of objects are packed into a bin. In this domain, we focus on deformability of objects to pack objects while preventing plastic deformation of objects. To do so, we introduce a set of five properties which are rigidity, bendability, foldability, compressibility, and irreversible deformability. An object can be classified as either rigid or nonrigid based on whether it maintains its shape when subjected to external forces. Nonrigid objects can exhibit deformation properties that depend specifically on their dimensions. One-dimensional (1D) objects such as a string or a needle can be bendable. Two-dimensional (2D) objects like a sheet of paper or a dish plate can be either foldable. 3D objects such as sponges can be compressed. If a deformed (i.e. bent, folded, and pushed) object cannot be recovered to its original state, it is plastic deformable regardless of its dimension.

The properties impose constraints or objectives for safe and space-efficient bin-packing: (i) Bendable (1D) objects can be bent to fit the size of the bin. (ii) Foldable (2D) objects can be folded to utilize the bin space as much as possible. (iii) Compressible (3D) objects can be compressed to secure more space for packing. Also, they can be used to protect objects that can have plastic deformation. (iv) Plastic deformable objects can be easily damaged. Protecting them using compressible objects can be beneficial for packing tasks. Note that rigid objects can be placed in a bin without constraints.

To probe these properties, we define a set of probing actions $A_{p} = {a_{bend}, a_{fold}, a_{push}, a_{recover}}$ which are performed by a dual-arm robot. As described in Figure 3(b), the robot applies a bending action $a_{bend}$ to 1D objects by grasping both ends of the object and attempting to bend it until one end meets the other. Similarly, the robot applies a folding action $a_{fold}$ to 2D objects by grasping one endpoint on the right and pressing the opposite endpoint, attempting to fold the object until the endpoints meet. The robot applies a pushing action $a_{push}$ to 3D objects by pushing the center of the object from the top. After applying each of these deforming actions (e.g. $a_{bend}$ , $a_{fold}$ , and $a_{push}$ ), the robot applies a recovery action $a_{recover}$ for those objects that are not rigid. The resulting states of the objects can determine whether they are plastic deformable or not. Three images are captured during property probing: $i_{1}^{A_{p}}$ before the deforming action, $i_{2}^{A_{p}}$ during the deforming action, and $i_{3}^{A_{p}}$ after a recovery action.

Figure 3.

The decision tree approach for determining object properties. (a) By applying a series of actions, one of the properties at the leaf nodes is determined. (b) Depending on the dimension of objects, a dual-arm robot performs appropriate probing actions. For example, the property of blue_2D_rectangle is plastic as the object is foldable but not recovered to its original state even after an unfolding action. (a) A decision tree to determine the property of an object and (b) an example use of the tree.

We develop two approaches to utilize images for reasoning with the LLM. In the first approach, the LLM is directly queried about the properties using three images, with each image labeled to indicate the action that produced it. The second approach employs a decision tree (Figure 3(a)) based on property definitions provided by $D$ . An example of this process is shown in Figure 3(b). The label blue_2D_rectangle in $O$ is assigned to a blue paper sheet in the image, whose properties are initially unknown. In $i_{1}^{A_{p}}$ , it appears to be flat. Since it is recognized as a 2D object, the robot performs $a_{fold}$ . The LLM then recognizes the object as folded in $i_{2}^{A_{p}}$ . Finally, the robot applies $a_{recover}$ to unfold the object. Since the object does not return to its original flat shape, its property is determined to be plastic deformable.

Our overall procedure for the property reasoning method is described in Algorithm 1 and prompts are provided in online Appendix A.2. In line 3, the VLM draws bounding boxes of objects in $I$ . The processed images are sent to the LLM to label the objects to obtain $O$ . If an object in $O$ is already in $K$ , its properties are retrieved from $K$ (line 5). Otherwise, the probing process $f_{p}$ is executed (line 8). In the latter case, an unseen object turns into a known object so $K$ is augmented with that object. Finally, $O$ is returned with all objects known.

Task planning

Domain generator

Task planning requires a domain description (Domain) and a problem instance description (Problem). The domain description specifies which actions can be applied to which objects, while the instance description includes the initial state and the goal state of a particular planning instance. In our implementation, the LLM is asked to generate two descriptions in Python as task planning is done by executing the Python script including the descriptions.

A domain description $Domain = (D, A)$ consists of the object description $D$ and the action description $A$ which is generated to comply with the constraint set $C$ . We assume that $C$ is given from a domain expert or commonsense knowledge, where our bin-packing task has constraints to protect plastic deformable objects from deformation and use the bin space frugally: (i) Before placing a plastic deformable object, a compressible object should be in the box before. Nonplastic deformable objects can be placed without any constraints. (ii) Push all compressible objects after placing them in the box. (iii) If there is a bendable object, bend it before placing it in the box. (iv) If there is a foldable object, fold it before placing it in the box. (v) Do not bend, fold, or push a plastic deformable object.

The domain generator receives the knowledge obtained by the property reasoner. It uses the predicate generator and the action generator (which belong to the domain generator in Figure 2) to generate $D$ and $A$ , respectively. From $O$ , the predicate generator produces $D$ which contains a set of predicates describing objects and their states. Each $X$ of $o^{'} \in O$ is converted into predicates such as is_rigid or is_plastic by prompting the LLM (the prompts are provided in online Appendix A.3). In addition to the predicates describing object properties, $D$ includes domain-dependent predicates such as in_bin. This process can be expressed as: $D = LLM (O)$ .

The action generator produces the action description $A$ from $D$ , given action primitives $Ψ = {ψ_{1}, ψ_{2}, \dots}$ and $C$ (i.e. $A = LLM (D ∣ Ψ, C)$ ). The action primitives are simply abstract-level manipulation actions. In our bin-packing task, $Ψ$ consists of primitive actions pick, place, bend, fold, and push. The LLM converts the action primitives into lifted actions (action schema) $ψ_{i}^{g}$ which are described by their parameters, the required states to perform the actions (i.e. preconditions), and the consequences of the actions (i.e. effects). The preconditions and effects are generated to comply with all constraints given by $C$ . The prompts for generating $A$ are given in online Appendix A.4.

Instance descriptor

An instance description $Problem = (S_{init}, S_{goal})$ consists of the initial state and the goal state by the initial state descriptor and the goal state descriptor, respectively (which belong to the instance generator in Figure 2). From $O$ and $D$ , the initial state descriptor generates $S_{init}$ which represents the current states of $o^{'}$ (i.e. $S_{init} = LLM (O, D)$ ). The goal state descriptor generates $S_{goal}$ from $O$ and $L$ (i.e. $S_{goal} = LLM (O, L)$ ). While $S_{init}$ is in Python, $S_{goal}$ is in a table to concisely summarize the resulting state after executing the task plan. The prompts to the LLM for the instance generation are provided in online Appendixes A.5 and A.6 where example results are also given in the prompts.

Task planner and plan validator

By the domain generator and instance generator, all necessary components for task planning are prepared. The task planner generates a grounded task plan, while the plan validator checks its validity. If the plan is found to be invalid, it is fed back to the task planner for replanning.

The LLM as the task planner produces a plan $T = (l_{1}, l_{2}, \dots, l_{N})$ in Python, using $Domain = (D, A)$ and $Problem = (S_{init}, S_{goal})$ . The Python functions invoked in the task plan are the grounded actions of the lifted actions defined in $A$ . Together with $S_{init}$ , an executable Python script is completed. The prompts for task planning are shown in online Appendix A.7 where an example Python script is shown in online Appendix C.1.

The plan validator (also the LLM) examines the task plan by executing the Python script and looking at the execution result denoted by Exe. Errors in Exe indicate incorrectly generated task plans which are either a syntax error (i.e. grammar error) or a semantic error (i.e. constraint violation). If no syntax error is detected, the validator checks if the result contains an error tag. (In our implementation, the tag starts with Cannot followed by the name of the inexecutable action as shown in online Appendix C.2. The Python function corresponding to the action is defined to prompt the tag if the precondition is not satisfied.) Even though a semantic error is detected, it is difficult to identify the source of the error which can be the action generation or task planning. An action might be generated not to comply with the constraints given by $C$ . Otherwise, the order of the actions in the task plan might be incorrect.

Therefore, if a semantic error is detected, we regenerate $A$ by providing Exe and then replan $T$ . This process can be expressed as $A^{'} = LLM (D, E x e ∣ Ψ, C)$ and $T^{'} = LLM (D, A, S_{init}, S_{goal}, E x e)$ . Our method repeats this validation process iteratively until $T$ without any error is obtained (the second example in online Appendix C.2).

Experiments

We design a set of experiments to show the performance of the proposed method where metrics are (i) the accuracy of the property reasoner in probing the physical properties of objects and (ii) the success rate of the task planner given the physical properties of objects.

The set of 14 objects used in the experiments is shown in Figure 4 including rigid and nonrigid objects with different dimensions. Our object set consists of objects that are difficult to recognize in their physical properties if visual information is solely given. For example, predicting the properties of Objects 1 and 2 is difficult due to a lack of visual features. Using selected objects, we generate 38 bin-packing random instances where each instance is with 3 to 7 objects. Since the LLM could produce different answers even for the identical prompt, we repeat testing the same instance 10 times to obtain the statistics.

Figure 4.

The set of 14 objects used in the experiments. The dimensions and physical properties of them are listed in the second and third columns of Table 1(a).

Table 1.

The success rates of the property reasoner and its components, with numerical results corresponding to Figure 6.

(a) Object naming
Object	Dim	Shape	Total	Object	Dim	Shape	Total
1	100.00	100.00	100.00	8	100.00	100.00	100.00
2	100.00	100.00	100.00	9	96.36	100.00	96.36
3	100.00	100.00	100.00	10	100.00	100.00	100.00
4	100.00	100.00	100.00	11	100.00	100.00	100.00
5	100.00	71.88	71.88	12	100.00	100.00	100.00
6	95.38	100.00	95.38	13	100.00	97.06	97.06
7	100.00	100.00	100.00	14	100.00	100.00	100.00

(b) Object detection
	Success rate (%)	Missing objects (%)	Hallucinations (%)
None	97.63	1.84	0.53
BB	99.37	0.00	0.53
Graph	100.00	0.00	0.00
BB+Graph	100.00	0.00	0.00

(c) Property probing
Index			No interaction		Interaction		Index			No interaction		Interaction
Object	Dim	Properties	Image	Image+Name	Interact	Interact+Tree	Object	Dim	Properties	Image	Image+Name	Interact	Interact+Tree
1	3	P	0.00	0.00	100.00	100.00	8	2	F	0.00	0.00	100.00	100.00
2	3	R	0.00	0.00	0.00	0.00	9	1	R	60.00	100.00	0.00	0.00
3	3	C	80.00	20.0	100.00	100.00	10	1	B	0.00	0.00	100.00	100.00
4	3	R	100.00	100.00	100.00	100.00	11	1	P	0.00	0.00	100.00	40.00
5	3	C	100.00	0.00	100.00	100.00	12	3	C	30.00	0.00	100.00	90.00
6	2	P	00.00	0.00	0.00	40.00	13	3	C	0.00	0.00	100.00	60.00
7	2	F	90.00	0.00	100.00	100.00	14	3	R	20.00	100.00	100.00	100.00

Settings

We use two Agile-X 6-DOF manipulators for the experiments to perform property probing and task execution. The robot for both property probing and task planning are generated by human teaching, which is called Programming by Demonstration (PbD) paradigm, where each action trajectory is recorded through human teleoperation and subsequently replayed by the robots. The LLM used in the experiments is GPT-4o where the parameters are set to the default values except for temperature set to $0.2$ and top_p set to $0.7$ for text-only input. These values are set to $0.2$ and $0.1$ for prompting visual information.

Property reasoning

The performance of property reasoning heavily depends on the performance of object detection and naming. Thus, we first measure the success rates of object detection and naming. Then we measure the accuracy of property probing for those objects that are successfully detected and named.

For each bin-packing instance among 38, our method uses two images from different perspectives that are top and side views as shown in Figure 5. The two images contain the same objects. The object detection is successful if all objects in the images are detected without any false positives. In other words, a successful object detection must not have a missing object and a hallucination. To test the effects of the bounding boxes and the graph, we compare four methods for object detection which are None (raw images), BB (bounding box only), Graph (graph only), and BB+Graph (both). Each instance is tested 10 times with an identical prompt so a total of 380 trials. Figure 5 shows the bounding boxes and graph. As summarized in Table 1b and Figure 6(b), Graph and BB+Graph achieve 100.00% of the success rate. While None and Graph achieve high success rates, providing additional visual information helps improve the detection performance of the LLM. Specifically, providing spatial information (i.e. using the graph) about the objects plays a crucial role in detecting missing objects and preventing hallucinations.

Figure 5.

An example of a set of images processed by a VLM and a connection graph. Since the objects are unknown yet, their labels are determined arbitrarily. The graph helps the LLM understand the spatial relationship between objects which reduces the hallucination. VLM: visual-language model; LLM: large-language model.

Figure 6.

The success rates of the property reasoner and its components. (a) The high success rate of object naming shows that the LLM can recognize the shape and dimension of objects consistently. (b) The effectiveness of the additional visual information (the bounding boxes and graphs) is shown. (c) The physical interactions with objects greatly help in understanding the properties of objects. (a) Object naming, (b) object detection, and (c) property probing. LLM: large-language model.

We measure the success rate of object naming where objects are detected by BB+Graph. A successful naming should provide both the correct dimension and the shape of the object in the name. The object color is not considered as color is not used in the bin-packing task. The 38 instances have 184 objects in total including duplicates (i.e. the same object is counted twice if it is in two instances). By repeating 10 trials for each instance, we have a total of 1840 resulting names to measure the success rate. As shown in Table 1a and Figure 6(a), object dimensions and shapes are correctly identified in most cases. While most objects achieve success rates close to 100%, Object 5 is occasionally misclassified as a cuboid (28.12%) despite being cylindrical. We believe this is due to the ambiguity of the shape as it is not a perfect cylinder, as it features partial angulations. Its translucency and lack of texture add further complexity to the shape recognition process of the LLM.

Finally, we measure the performance of the property probing given the satisfactory performance of the object detection and naming. We have 10 trials for each of 14 objects, so 140 test cases. The deforming actions (e.g. $a_{bend}$ , $a_{fold}$ , and $a_{push}$ ) are individually taught using the PbD method. Each action is followed by a corresponding recovery action, executed sequentially within a single cycle. (Refer to online Appendix D.2 and Supplemental video for further details.) The ground truth properties of the objects are given in Table 1c.

We compare four different methods; Image, Image+Name, Interact, and Interact+Tree as follows:

Image: Does not use the probing actions but use the LLM only for reasoning about the properties.

Image+Name: Same as Image, but includes the object label identified by a pretrained Vision Transformer.⁶⁵

Interact: Uses three images showing the state of the object before, during, and after interaction.

Interact+Tree: Same as Interact, but additionally uses the decision tree described in Section “Property Probing.”

We note that the image used in Image and Image+Name is the same as the before probing image used in Interact and Interact+Tree.

The result shown in Table 1c and Figure 6(c) demonstrates that our interaction-based methods achieve accuracy up to 78.57% across all objects. In contrast, the noninteraction methods, Image (34.29%) and Image+Name (22.86%), show a significant limitation in predicting physical properties as they do not involve physical interactions with the objects. The lower success rate of Image+Name indicates that providing object names to the LLM could bias the reasoning of it. The Interact (78.57%) method outperforms Interact+Tree (73.57%) slightly. This result indicates that using the decision tree could limit the flexibility of the reasoning process of the LLM. While the experimental results show that robot interactions significantly help the LLM reason about the object properties, imposing constraints (i.e. object names, decision trees) appears to restrict the reasoning capabilities of LLMs, leading to reduced performance. The reduced accuracy of the decision tree approach results from its inability to support joint analysis of the probing images. Moreover, the hand-designed branching rules constrain the flexibility of the LLM to adapt its visual analysis to subtle differences in material behavior. In contrast, the free-reasoning variant allows the LLM to integrate information across all probing observations without predefined decision rules. As multimodal LLMs continue to advance, prompting strategies that restrict the flow of information are expected to offer diminishing benefit compared to open-ended reasoning.

To further validate the effectiveness of property probing through robot–object interactions, we conduct additional experiments using objects with similar visual appearances but differing physical properties. We select five objects for evaluation, Object 5 and Object 7, along with their visually similar counterparts, Objects A and B for Object 5, and Object C for Object 7, as described in Figure 7(a). Objects 5, A, and B are labeled white_3D_cylinder, while Objects 7 and C are yellow_2D_rectangle. Each object is tested in 10 trials with both the Robot and Robot+Tree methods. As shown in Figure 7(b), our method achieve an overall success rate of 82%. Notably, for 3D objects, the Robot and Robot+Tree methods achieve 96.67% and 100% accuracy, respectively. For 2D objects, however, the LLM misclassifies Object C as foldable, a mistake also observed with Object 6 (blue_2D_rectangle) in Table 1c. These results indicates that the thin shape of the 2D objects in side views may obstruct the LLMs to infer their physical properties. The results demonstrate the robustness of our method in identifying the physical properties of visually similar objects, while leaving room for improvement when probing 2D objects.

Figure 7.

Property probing for similarly shaped objects (e.g. white_3D_cylinder and yellow_2D_rectangle) exhibiting different physical properties (i.e. compressible, rigid, plastic deformable, foldable). (a) Objects with similar appearances (e.g. Object 5, Object A, Object B; Object 7 and Object C) may possess different physical properties. (b) Robotic interactions provide critical cues that enhance the reasoning capabilities of LLMs. (a) Property probing for objects with similar appearances and (b) results of property probing through robotic interaction. LLM: large-language model.

Task planning and plan validation

Real-world bin-packing scenario

We examine the resulting task plans to see if the plans comply with $C$ and achieve $S_{goal}$ . Results of the test instances fall into two cases: (Case I) Constraints are violated and the goal is not reached. (Case II) Constraints are not violated but the goal is not reached. The violation of the constraints can be detected by examining the execution result of the task plan as described in Section “Task planner and plan validator.” Also, we can figure out if the goal state is reached by updating and checking the states of objects (i.e. in_bin of all objects are true). Finally, to validate the effectiveness of the discovered object properties in task planning, we also execute several representative task plans with real object.

We first measure the success rate of Case I (i.e. violating $C$ and not reaching $S_{goal}$ ) among 380 instances. As shown in Figure 8(a), 77.89% of the instances do not violate the constraints in the initial task plan. With one replanning, this rate increases to 92.63%. After successive four replanning queries, the success rate rises to 97.11%, 98.16%, 98.68%, and 98.95%, respectively. The most significant improvement occurs after the first replanning where most errors are about the constraint violation, which are easy to detect and correct. For example, the sequence of actions pick, fold, and place cannot be executable as the robot cannot fold an object while it is being picked. Another example is applying pick action only to rigid objects. The most difficult errors, which are rarely corrected even after a few more replanning queries, are caused by the hallucination of the LLM regarding the constraints. For instance, the constraint “Before placing a plastic object, a compressible object should be in the box before” often makes the LLM predict that one of the objects is plastic deformable even though no plastic deformable object exists.

Figure 8.

The task planning result. (a) The success rates of task planning where replanning has shown the effectiveness of correcting errors. (b) The real-world implementation demonstrates that the probed properties can be effectively utilized in task planning. (a) The success rates of task planning and replanning and (b) the real-world execution of the task planning.

There are eight instances which belong to Case II. The most common failing instances involve Object 10 (black_1D_line), where the object is bent but not subsequently picked and placed for packing. However, as the replanning progresses, failures in reaching the goal state significantly decrease. Including both cases, the success rate of initial task planning is 76.05%. After successive five replanning queries, the success rate improves to 90.53%, 95.00%, 96.05%, 96.58%, and 96.84%, respectively.

To illustrate how the discovered object properties can influence the result of task planning, we execute task plan examples as shown in Figure 8(b). (Refer to online Appendix D.3 and Supplemental video for full task planning execution.) All grounded actions in the plan such as “fold yellow_2D_rectangle” are executed using PbD method. During execution, as the yellow_2D_rectangle is known to be foldable, the robot folds the object to utilize bin space for packing. After folding the object, the robot push the red_3D_polyhedron after place it into the bin. Finally, the robot pick and place the blue_3D_cylinder. This execution provides meaningful validation that the discovered object properties can be effectively integrated into real-world robotic task planning.

Additional scenarios

To examine whether the proposed framework generalizes beyond the bin-packing domain, we conduct additional simulation experiments across several task settings that require deformable object reasoning. The evaluation includes three representative tasks, namely RopeFlattening, ClothFolding, and a Blocksworld domain. Figure 9 illustrates simulation settings for property probing.

Figure 9.

Simulation environments for deformable-object tasks. The figure illustrates the three simulated domains used to evaluate generalization: RopeFlattening (left), ClothFolding (center), and Blocksworld with deformable constraints (right).

In the RopeFlattening task, the agent straightens the rope when the rope is bendable. If the rope is not bendable, the agent arranges it without flattening. In the ClothFolding task, the agent folds every object that is identified as foldable and removes any object that is not foldable from the plan. In the Blocksworld task, the planning rule requires every deformable cube to appear at the top of the final stack.

We first evaluate the accuracy of property probing in each domain using 10 simulation trials. The probing success rates are 100%, 40%, and 100% for RopeFlattening, ClothFolding, and Blocksworld, respectively. We then measure the task planning success rates under the condition that all required properties are correctly inferred. As shown in Table 2, the initial planning success rates are 70%, 100%, and 40% for RopeFlattening, ClothFolding, and Blocksworld, respectively. After five replanning query, the success rates increase to 100%, 100%, and 60% for the same domains. This result shows that our framework operates consistently across various planning tasks. Notably, for the ClothFolding domain, the accuracy of the property probing achieves 40%. This result suggests that the visual ambiguity of partially folded or wrinkled cloth states makes it difficult for LLMs to consistently distinguish foldable objects from plastic deformable ones. In the Blocksworld domain, the planning success rate remains limited even after multiple replanning steps because the task requires long-horizon decisions. In addition, the LLM often violates symbolic rules in stacking steps, such as the correct update of clear statuses (e.g. the lower block should become false and the upper block should become true). These observations indicate that the difficulty of the task and the clarity of the probing signal both influence downstream planning performance, showing where additional structure or feedback may be required.

Table 2.

Planning success rates (%) across simulation tasks.

Task	Initial success	After replanning
RopeFlattening	70	100
ClothFolding	100	100
Blocksworld	40	60

Discussion

Planning behavior

While the success rate of task planning exceeds 95% after a few iterations, the first planning query does not give a sufficiently good performance. This limitation might come from the ambiguity in the prompts, limited domain knowledge of the LLM, incomplete understanding of physical or contextual understanding of the LLM, etc. For example, the term ‘plastic’ might provide an ambiguous signal to the LLMs, interpreting it as a material type rather than its intended concept of ‘plastic deformable’ in mechanics. In addition, our method consists of several sequential processes where the result of preceding processes significantly impacts the following processes.

In the experiments, we use the PbD method to generate motions for probing actions and task execution, without any learning or generalization. Although achieving full autonomy is not the focus of this work, we briefly attempted to apply the state-of-the-art imitation learning method ALOHA.⁶⁶ However, the method did not perform reliably for our tasks, which involve fine manipulation of deformable objects, such as ‘bend’ or ‘fold’ actions. Given these considerations, record-and-replay (i.e. PbD) serves as a fast and efficient solution for the scenarios in this study. Looking forward, the success of our simple probing actions suggests that our approach could be extended by combining it with advanced models such as VLA or imitation learning to achieve more robust autonomy. We note the code in online Appendix E.

Generalization scope

Compared with classical or learned pipelines, our LLM-based approach offers practical advantages across multiple dimensions that are relevant for reasoning about previously unseen objects. First, our pipeline does not require task-specific training datasets or annotated material labels, which are often necessary in supervised or feature-engineered systems. This property allows the framework to scale to new domains without requiring additional data collection. Second, the LLM can interpret ambiguous or underspecified descriptions through linguistic priors, enabling our pipeline to handle uncertainty more flexibly than rule-based models. Third, our method generalizes naturally to unfamiliar objects and varied task settings, whereas classical pipelines remain constrained by predefined categories and domain-specific features. Classical approaches still provide strengths such as low-computational latency, strong safety guarantees, and high reproducibility through deterministic rules, but these benefits come with limited adaptability when environments contain diverse objects or shifting task conditions. In contrast, LLM-based reasoning draws on broad semantic knowledge and adapts more effectively to unfamiliar or uncertain situations, which becomes especially valuable in open-ended environments. Overall, our pipeline offers clear advantages in settings that are unstructured, include previously unseen objects, or require flexible reasoning beyond the capabilities of classical methods.

We conduct experiments with 14 objects that have different material properties under a fixed camera setup and a bin-packing task. Although the evaluation is limited to this setting, the same framework can be used in broader conditions. The possibility of generalization can be examined in several directions, including a wider range of object types, changes in camera positions or sensor viewpoints, variations in prompt wording, and the use of different task domains. Future evaluations may also use more complex everyday objects such as laptops or headsets that include multiple parts or mixed materials. This setting can show whether the framework can handle objects that do not have uniform structure. Additional sensing inputs such as touch, sound, or temperature may help in cases where visual information is not enough. It is also useful to examine how much the results change when the prompt wording changes. Finally, using the framework in other tasks can show whether the inferred physical properties support planning in situations that require different types of actions.

Future directions

The proposed approach presents several challenges that remain for future work. First, the validator component needs better scalability to handle a larger number of objects and more detailed physical attributes. Second, the system connects physical reasoning to task planning but does not yet link this reasoning to motion level execution in a direct way. Third, the current pipeline requires noticeable computation time, which limits its use in real-time settings. The reliability and reproducibility of LLM-based inference also change with the choice of model version, decoding settings, and prompt wording. These forms of variation are not present in classical deterministic pipelines and therefore introduce additional sources of uncertainty in the reasoning process.

As an initial attempt to understand the physical properties of previously unseen objects, our work still leaves room for improvement. In particular, generalization can be extended to cover a broader range of object properties (such as fragility or elasticity) and different types of manipulation tasks (such as cooking or household activities). Another direction is to detect irreversible properties without prior knowledge or specialized sensors without causing damage to the object. Finally, although we include a decision tree variant as an ablation, we intentionally restrict our analysis to a single hand-designed structure. Exploring richer or learned tree designs remains an interesting direction for future work but lies beyond the scope of this study. In the future, we will develop more general probing skills applicable to a wider range of physical properties.

Conclusion and future work

In this article, we propose an LLM-based reasoning method to understand the physical properties of unseen objects through the interactions between the objects and a robot. Experimental results demonstrate the effectiveness of the proposed method in task planning with iterative improvements.

In future work, we plan to extend the framework to a broader range of objects with varied material behaviors and to evaluate it in more complex manipulation tasks, such as household tidying where robots must handle fragile and deformable items. We also intend to explore additional sensing modalities such as tactile or force sensing to enrich the inferred physical representations and further enhance the robustness of the planning process. We also plan to bridge task-level plans with the execution layer so that the robot can carry out the actions with more consistent motion performance.

Supplemental Material

Footnotes

ORCID iDs

Changmin Park

Beomjoon Lee

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (MSIT) (No. RS-2024-00411007 and No. RS-2024-00461583).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Fikes

Nilsson

Strips

. A new approach to the application of theorem proving to problem solving. Artif Intell 1971; 2: 189–208.

Aeronautiques

Howe

Knoblock

, et al. PDDL: the planning domain definition language. Technical Report NAVTRADEVCEN, 1998, pp. 1–26.

Liang

Huang

Xia

, et al. Code as policies: language model programs for embodied control. In Proceedings of international conference on robotics and automation (ICRA), 2023, pp.9493–9500.

Ahn

Brohan

Brown

, et al. Do as I can, not as I say: grounding language in robotic affordances. arXiv preprint arXiv:220401691 2022.

Huang

Xia

Xiao

, et al. Inner monologue: embodied reasoning through planning with language models. arXiv preprint arXiv:220705608 2022.

Bhalgat

Smart

, et al. When LLMs step into the 3D world: a survey and meta-analysis of 3D tasks via multi-modal large language models. arXiv preprint arXiv:240510255 2024.

Liu

Chen

, et al. Scene-LLM: extending language model for 3D visual understanding and reasoning. arXiv preprint arXiv:240311401 2024.

Wang

Han

Jiao

, et al. LLM3: large language model-based task and motion planning with motion failure reasoning. In: Proceedings of international conference on intelligent robots and systems (IROS), 2024, pp.12086–12092. IEEE.

Shirai

Beltran-Hernandez

Hamaya

, et al. Vision-language interpreter for robot task planning. In: Proceedings of International conference on robotics and automation (ICRA), 2024, pp.2051–2058.

10.

Zhao

Lee

Hsu

. Large language models as commonsense knowledge for large-scale task planning. In: Proceedings of Advances in neural information processing systems (NeurIPS), 2024, Vol. 36.

11.

Wang

, et al. Embodied task planning with large language models. arXiv preprint arXiv:230701848 2023.

12.

Ning

Dong

. Learning foresightful dense visual affordance for deformable object manipulation. In: Proceedings of International conference on computer vision (ICCV), 2023, pp.10947–10956.

13.

Wang

Qureshi

. DeRi-Bot: learning to collaboratively manipulate rigid objects via deformable objects. IEEE Robot Autom Lett 2023; 8: 6355–6362.

14.

Deng

Xia

, et al. Learning language-conditioned deformable object manipulation with graph dynamics. In: Proceedings of international conference on robotics and automation (ICRA), 2024, pp.7508–7514.

15.

Guler

Pauwels

Pieropan

, et al. Estimating the deformability of elastic materials using optical flow and position-based dynamics. In: Proceedings of international conference on humanoid robots (Humanoids), 2015, pp.965–971.

16.

Petit

Ficuciello

Fontanelli

, et al. Using physical modeling and RGB-D registration for contact force sensing on deformable objects. In: Proceedings of international conference on informatics in control, automation and robotics (ICINCO), 2017, Vol. 2, pp.24–33.

17.

Haouchine

Kuang

Cotin

, et al. Vision-based force feedback estimation for robot-assisted surgery using instrument-constrained biomechanical three-dimensional maps. IEEE Robot Autom Lett 2018; 3: 2160–2165.

18.

Zhou

Zheng

, et al. Reactive human–robot collaborative manipulation of deformable linear objects using a new topological latent control model. Robot Comput–Integr Manufact 2024; 88: 102727.

19.

Huang

Chu

, et al. Deformable object manipulation with constraints using path set planning and tracking. IEEE Trans Robot 2023; 39: 4671–4690.

20.

Matas

James

Davison

. Sim-to-real reinforcement learning for deformable object manipulation. In: Proceedings of conference on robot learning (CoRL), 2018, pp.734–743.

21.

Lubliner

. Plasticity theory. Mineola, NY: Courier Corporation, 2008.

22.

Liu

Zeng

Ren

, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:230305499 2023.

23.

Driess

Xia

Sajjadi

MSM

, et al. PALM-E: an embodied multimodal language model. arXiv preprint arXiv:230303378 2023.

24.

Huang

Xia

Shah

, et al. Grounded decoding: guiding text generation with grounded models for embodied agents. Adv Neural Inf Process Syst 2023; 36: 59636–59661.

25.

Sripada

Carter

Guerin

, et al. Scene exploration by vision-language models, 2025.

26.

Zhao

Weber

, et al. Chat with the environment: interactive multimodal perception using large language models. In: Proceedings of international conference on intelligent robots and systems (IROS), 2023, pp.3590–3596. IEEE.

27.

Chen

Hao

Zhang

, et al. Code-as-symbolic-planner: foundation model-based robot planning via symbolic code generation, 2025.

28.

Huang

, et al. Creative robot tool use with large language models. arXiv preprint arXiv:231013065 2023.

29.

Mei

Zhu

Zhang

, et al. ReplanVLM: replanning robotic tasks with visual language models. IEEE Robot Autom Lett 2024; 9: 10201–10208.

30.

Liu

Palmieri

Koch

, et al. DELTA: decomposed efficient long-term robot task planning using large language models. arXiv preprint arXiv:240403275 2024.

31.

Kwon

Kim

. Fast and accurate task planning using neuro-symbolic language models and multi-level goal decomposition. In: Proceedings of international conference on robotics and automation (ICRA), 2025, pp.16195–16201. IEEE.

32.

Rivera

Byrd

Paul

, et al. Conceptagent: LLM-driven precondition grounding and tree search for robust task planning and execution. In: Proceedings of International Conference on Robotics and Automation (ICRA), 2025, pp.8988–8995. IEEE.

33.

Kumar

Zhang

, et al. LIV: language-image representations and rewards for robotic control. In: Proceedings of international conference on machine learning (ICML), 2023, pp.23301–23320. PMLR.

34.

Bhat

Kaypak

Krishnamurthy

, et al. Grounding LLMs for robot task planning using closed-loop state feedback. arXiv preprint arXiv:240208546 2024.

35.

Zitkovich

, et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In: Proceedings of conference on robot learning (CoRL), 2023, pp.2165–2183. PMLR.

36.

Shen

Song

Tan

, et al. HuggingGPT: solving AI tasks with ChatGPT and its friends in hugging face. Adv Neural Inform Process Syst (NeurIPS) 2023; 36: 38154–38180.

37.

Jiang

Lei

Ashton

, et al. Multimodal LLM guided exploration and active mapping using fisher information, 2025.

38.

Huang

Wang

Zhang

, et al. Voxposer: composable 3D value maps for robotic manipulation with language models. arXiv preprint arXiv:230705973 2023.

39.

Zhang

Geng

, et al. ManipLLM: embodied multimodal large language model for object-centric robotic manipulation. In: Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp.18061–18070. IEEE.

40.

Singh

Blukis

Mousavian

, et al. ProgPrompt: program generation for situated robot task planning using large language models. Auton Robots 2023; 47: 999–1012.

41.

Gileadi

, et al. Language to rewards for robotic skill synthesis, 2023.

42.

Liu

Cui

, et al. Safe planner: empowering safety awareness in large pre-trained models for robot task planning. In: Proceedings of conference on artificial intelligence (AAAI), 2025, Vol. 39, pp.14619–14627.

43.

Khan

Andrev

Murtaza

, et al. Safety aware task planning via large language models in robotics. arXiv preprint arXiv:250315707 2025.

44.

Xiong

, et al. SELP: generating safe and efficient task plans for robot agents with large language models. In: Proceedings of international conference on robotics and automation (ICRA), 2025, pp.2599–2605. IEEE.

45.

Ravichandran

Robey

Kumar

, et al. Safety guardrails for LLM-enabled robots. arXiv preprint arXiv:250307885 2025.

46.

Kim

Min

Kim

, et al. Pre-emptive action revision by environmental feedback for embodied instruction following agents. In: Proceedings of conference on robot learning (CoRL).

47.

Pacaud

Pinel

Chen

, et al. Guardian: detecting robotic planning and execution errors with vision-language models. In: Workshop on making sense of data in robotics: composition, curation, and interpretability at scale (CoRL).

48.

Rana

Haviland

Garg

, et al. Sayplan: grounding large language models using 3D scene graphs for scalable robot task planning. arXiv preprint arXiv:230706135 2023.

49.

Wang

Kurutach

Liu

, et al. Learning robotic manipulator visual planning and acting. arXiv preprint arXiv:190504411 2019.

50.

Yan

Zhu

Jin

, et al. Self-supervised learning of state estimation for manipulating deformable linear objects. IEEE Robot Autom Lett 2020; 5: 2372–2379.

51.

Lippi

Poklukar

Welle

, et al. Latent space roadmap for visual action planning of deformable and rigid object manipulation. In: Proceedings of international conference on intelligent robots and systems (IROS), 2020, pp.5619–5626.

52.

Liu

, et al. FlingFlow: LLM-driven dynamic strategies for efficient cloth flattening. IEEE Robot Autom Lett 2024; 9: 8714–8721.

53.

Shridhar

Manuelli

Fox

. Cliport: what and where pathways for robotic manipulation. In: Proceedings of conference on robot learning (CoRL), 2022, pp.894–906.

54.

Bartsch

Farimani

. LLM-craft: robotic crafting of elasto-plastic objects with large language models. arXiv preprint arXiv:240608648 2024.

55.

Hanheide

Göbelbecker

Horn

, et al. Robot task planning and explanation in open and uncertain worlds. Artif Intell 2017; 247: 119–150.

56.

Song

Washington

, et al. LLM-planner: few-shot grounded planning for embodied agents with large language models. In: Proceedings of international conference on computer vision (ICCV), 2023, pp.2998–3009.

57.

Ahn

Dwibedi

Finn

, et al. AutoRT: embodied foundation models for large scale orchestration of robotic agents. arXiv preprint arXiv:240112963 2024.

58.

Shridhar

Thomason

Gordon

, et al. Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of conference on computer vision and pattern recognition (CVPR), 2020, pp.10740–10749.

59.

Gao

Sarkar

Xia

, et al. Physically grounded vision-language models for robotic manipulation. In: Proceedings of international conference on robotics and automation (ICRA), 2024, pp.12462–12469. IEEE.

60.

Xie

Lavering

Correll

. DeliGrasp: inferring object properties with LLMs for adaptive grasp policies. arXiv preprint arXiv:240307832 2024.

61.

Lai

Zhang

Lam

, et al. Vision-language model-based physical reasoning for robot liquid perception. In: Proceedings of international conference on intelligent robots and systems (IROS), 2024, pp.9652–9659. IEEE.

62.

Achiam

Adler

Agarwal

, et al. GPT-4 technical report. arXiv preprint arXiv:230308774 2023.

63.

Wei

Wang

Schuurmans

, et al. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of advances in neural information processing systems (NeurIPS), Vol. 35, pp.24824–24837.

64.

Cambridge Dictionary. Definitions for search terms: “1d”, “line”, “compressible”, and others, 2024. https://dictionary.cambridge.org/ (2024, accessed 24 October 2024).

65.

Dosovitskiy

. An image is worth 16

\times

16 words: transformers for image recognition at scale. arXiv preprint arXiv:201011929 2020.

66.

Zhao

Kumar

Levine

, et al. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:230413705 2023.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

Understanding physical properties of unseen deformable objects by leveraging large-language models and robot actions

Abstract

Keywords

Introduction

Related work

LLM-based robotic tasks

Deformable object manipulation

Domain generation

Problem formulation

Method

Property reasoner

Object detection and naming

Property probing

Task planning

Domain generator

Instance descriptor

Task planner and plan validator

Experiments

Settings

Property reasoning

Task planning and plan validation

Real-world bin-packing scenario

Additional scenarios

Discussion

Planning behavior

Generalization scope

Future directions

Conclusion and future work

Supplemental Material

Footnotes

ORCID iDs

Funding

Declaration of conflicting interests

Supplemental material

References

Supplementary Material