Sage Journals: Discover world-class research

Abstract

In this paper, we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g., counting), others are trainable neural functions (e.g., visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user’s vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, achieving an average success rate of 80.2%, both in simulation and with a real robot.

Keywords

neurosymbolic robotics natural human–robot interaction

Introduction

As modern developments in robotics are beginning to move robots from purely industrial to human-centric environments, it becomes essential for them to be able to interact naturally with humans. This necessity poses two additional challenges to traditional autonomy, as the agent is expected to be interactive, that is, able to receive task-specific instructions from its human cohabitants, as well as interpretable, that is, complete the task in a manner that is fully explainable to non-expert users. The second feature is of particular interest, as it enables humans to diagnose and correct erroneous robot behaviors via online interaction, for example, through free-form natural language. Grounding perception and action in natural language has been a central theme in recent computer vision and robotics literature, from language-grounded 3D vision (Achlioptas et al., 2020; Azuma et al., 2021; Chen et al., 2020), to language-conditioned manipulation (Jang et al., 2022; Lynch and Sermanet, 2020; Stepputtis et al., 2020), to integrated language-based systems (Ahn et al., 2022; Huang et al., 2022b; Zeng et al., 2022a) for high-level reasoning and task planning. Across domains, language has shown to be a great inductive bias for effective robot learning, however, methods still struggle with grounding fine-grained concepts beyond object category (i.e., visual attributes and spatial relations) (Shridhar et al., 2021), as well as reasoning about them in an algorithmic fashion (e.g., counting). The end-to-end nature of most approaches leads to additional limitations, namely: (a) lack of interpretability, as the underlying reasoning process required to solve the task is captured implicitly in the network’s representations and thus cannot be retrieved from the output, (b) data-hungriness, that is, need of large vision-language datasets that sufficiently sample the space of all possible concept combinations, and (c) closed-endedness, as the end-to-end policy is trained for a fixed agent/environment and catalog of concepts and tasks.

We believe that these limitations stem from the holistic fashion in which most methods couple language with perception. In particular, they either rely on visual-text feature fusion in a joint space (Chen et al., 2021; Hatori et al., 2017; Shridhar et al., 2021; Shridhar and Hsu, 2018; Stepputtis et al., 2020), or FiLM-conditioning (Perez et al., 2017) the visual network with a sentence-wide embedding of the language input (Ahn et al., 2022; Jang et al., 2022). We argue that this methodology fails to exploit the compositional nature of language, instead relying on variance to learn one-to-one correspondences between task descriptions and robot behavior. For instance, consider a scenario like the one shown in Figure 1, where a human asks a question about the scene: (e.g., “How many sodas are in front of the white book?”). The task requires grounding multiple different concepts (i.e., visual—“book,” “white,” spatial—“front” and symbolic—“How many”) and reason about the intermediate results to reach a final answer. Our intuition is that, for a human, the logic behind solving this task is compositional (a hierarchy of primitive steps) and disentangled from perception, meaning that the reasoning steps illustrated in Figure 1 can be generalized to all similar questions regardless of the actual scene content.

Figure 1.

Example scenarios where a human user interacts with the robot in natural language. Understanding the input question/instruction often requires reasoning about properties or relations of appearing objects in a compositional manner. Neurosymbolic approaches parse the input question into the underlying reasoning program and execute it step-by-step in order to reach the final answer (top). Similarly, we propose a neurosymbolic model that represents grasp policies as programs in an interpretable formal language. End-to-end vision-language-grasping methods learn a policy directly from raw inputs and thus actions are generated regardless of the scene content. In the second example (bottom), there is no red soda for the robot to grasp, but only our approach is able to capture this and communicate it to the user.

Such intuition is encapsulated within neurosymbolic frameworks (Johnson et al., 2016; Liu et al., 2019; Mao et al., 2019; Yi et al., 2018), that propose to further inject prior knowledge about language in the form of symbolic programs (Yi et al., 2018), which explicitly describe the underlying reasoning process. The overall task is decomposed into independent sub-tasks (primitives), and each one is implemented as a symbolic module in a Domain-Specific Language (DSL). The idea is to use deep neural nets as parsing tools—from images to structured object-based representations and from text queries to programs—and pair them with a symbolic engine for executing the parsed program in the scene representation to reach an answer. By disentangling perception and language understanding (neural) from reasoning (symbolic), neurosymbolic systems address several of the highlighted limitations, that is, other than a final answer, they output a formal interpretable representation of the underlying reasoning process (see Figure 1). Furthermore, utilizing programs as a prior for learning grants the system highly sample-efficient and aids in generalization to unseen concept-task combinations (Mao et al., 2019; Yi et al., 2018). However, prior arts are limited to REF/VQA tasks, and associated datasets (Johnson et al., 2016; Liu et al., 2019) model abstract synthetic domains with a poor variety of object and relation semantics. Proposed methods also fix their DSL to be aware of the domain vocabulary (i.e., primitives are coupled with concept arguments), limiting them to the concepts encountered at training time.

In this work, we wish to propagate neurosymbolic reasoning to the robotics field and utilize it as an auxiliary process for interpretable robot manipulation. To that end, we generate a synthetic 3D vision-and-language dataset with a broad collection of object categories, attribute, and relation concepts. We design a corresponding DSL and re-formulate components of previous neurosymbolic recipes in order to handle the open-vocabulary requirement (see schematic in Figure 2). In particular, we decompose the language-to-program module into two steps, first identifying concepts in the sentence to create an abstracted version of the query and then feeding it to a seq2seq network to generate the program, thus relieving the latter component from having to deal with the specific concept vocabulary of the training set. To ground (potentially unseen) concepts in the image, we use concept grounding networks that operate on latent object-relation features, serving as an alternative to classification. We compare our method with other holistic/neurosymbolic baselines in terms of accuracy and sample-efficiency and show that it can be transferred to real images via few-shot fine-tuning of the visual grounder network. We further integrate our model with a robot framework and test its performance in an interactive object grasping task, where we show that its highly interpretable nature allows us to study the distribution of failure modes across the different system components. We close our evaluation by showing that the method can be efficiently extended to more manipulation tasks with the cost of a few hundred relevant instruction-program annotations. In summary, the key contributions of this work are threefold:

• We generate a synthetic dataset of household objects in tabletop scenes for REF/VQA/grasping tasks, equipped with program annotations for reasoning, and collect a small-scale real-scene counterpart for evaluation. We make both datasets publicly available.

• We propose a neurosymbolic framework that integrates instance segmentation, visual/spatial grounding, semantic parsing and grasp synthesis in a vocabulary-agnostic formulation that supports application in unseen vocabulary, granting it transferable to novel concepts/tasks with minimal adaptation.

• We perform extensive experiments to show the merits of our approach in terms of (i) interpretable, highly accurate, and sample-efficient reasoning, evaluated through a VQA task, (ii) robustness to users vocabulary, (iii) efficient adaption to natural scenes and more manipulation tasks, and (iv) applicability for interpretable interactive object grasping, tested both in simulation and with a real robot.

Figure 2.

A schematic of the proposed framework. First, objects are segmented and localized in 3D space (top left) and the scene is represented as a graph of extracted object-based features (visual, grasp pose) as nodes and spatial relation features as edges (top middle). A human user provides an instruction and a language parser generates an executable program (bottom left), built out of a primitives library (bottom middle). A program executor utilizes a set of concept grounding modules to ground words to different objects (center) and executes the predicted program step-by-step (top right), in order to identify the queried object and instructs the robot to grasp it (bottom right).

Related works

Grounding referring expressions

Grounding visual and spatial concepts expressed through language is a central challenge for an interactive robot. Deep learning literature poses this through the task of grounding referring expressions (REF) (Plummer et al., 2015; Yu et al., 2016a), that is, localizing an object in a scene from a natural language description. Methods usually employ a two-stage detect-then-rank approach, leveraging off-the-shell detectors to first propose objects and then rank their object-query matching scores through CNN-LSTM feature fusion (Mao et al., 2015; Rohrbach et al., 2015; Yu et al., 2016b) or attention mechanisms (Luo and Shakhnarovich, 2017). Alternatively, richer cross-modal contextualization between images and words is pursued through external syntactic parsers (Andreas et al., 2016; Cirik et al., 2018), graph attention networks (Wang et al., 2018; Yang et al., 2019a, 2019b), or Transformers (Chen et al., 2019; Li et al., 2019; Lu et al., 2019; Yu et al., 2020). Single-stage methods (Du et al., 2021; Sadhu et al., 2019; Yang et al., 2019c) attempt to alleviate the object proposal bottleneck by densely fusing textual with scene-level visual features to create joint multimodal representations. Transferring from large-scale vision-language pretraining (Li et al., 2021; Radford et al., 2021) aids in out-of-distribution generalization and can be used in zero-shot setups (Subramanian et al., 2022) or for open-vocabulary object detection (Gu et al., 2021). REF has been also extended to the 3D domain (Achlioptas et al., 2020; Chen et al., 2020), where similar to 2D, most methods employ detect-then-rank pipelines, fusing textual features with segmented point-clouds (Achlioptas et al., 2020; Zhao et al., 2021) or RGB-D views (Huang et al., 2022a; Liu et al., 2021). All the above approaches follow the holistic methodology, hence as argued in the previous section, suffer from data-hungriness and lack the desired interpretability property.

More closely to our work, modular approaches (Hu et al., 2016; Liu et al., 2018; Yu et al., 2018) decompose the grounding task in independent modules (e.g., entities, attributes, relations) and predict their composition based on the query’s structure with a language parser. Such methods use soft attention-based parsers that are trained end-to-end with the rest of the modules using weak supervision. In Tziafas and Kasaei (2022), the modules are trained separately using dense attribute- and relation-level supervision from synthetic data and are linked to words using a tagger network. However, module composition is handled by a linguistics-inspired heuristic, and hence, it is limited to referring expressions that follow a standard subject-relation-object syntax. Similarly, we use a tagger and dense synthetic supervision to train our modules but replace the heuristic with a seq2seq network, that can map arbitrary syntactic structure into a formal representation (program), expressed via a DSL. With this, we can extend the scope of the parser from grounding referring expressions to VQA and eventually robot action, by adding the associated modules in our DSL.

Neurosymbolic reasoning

Early works in modular networks for VQA (Andreas et al., 2015; Hu et al., 2017a, 2017b; Hudson and Manning, 2018; Johnson et al., 2017a, 2017b) demonstrate capacities for compositional vision-language reasoning, by integrating independent modules instead of end-to-end learners. More recently, a neurosymbolic model for VQA (NS-VQA) (Yi et al., 2018) in CLEVR (Johnson et al., 2016) and its extensions to natural images (Hu et al., 2019; Hudson and Manning, 2019; Wang et al., 2021) utilize a formal DSL and a symbolic program executor to run programs on parsed scene representations. Program generation and scene parsing (i.e., localization and attribute recognition) are trained separately and interface with the executor only at test-time. In such works, however, the scene is represented as a table of attribute labels (Yi et al., 2018) or features (Mao et al., 2019), without any relation information. Resolving spatial relations is then achieved by using concept-specific heuristics as primitives (e.g., relate left). Visual attribute concepts are either classified (Yi et al., 2018) and coupled with primitives or matched with concept representations learned jointly from a closed-set (Mao et al., 2019). This formulation makes the system fixed to the concept vocabulary encountered during training. In our work, we integrate relation concepts with object-based features in a latent scene graph representation and make our primitives vocabulary-agnostic, allowing extension to novel concepts without touching the DSL, via concept grounding networks. Like NS-CL (Mao et al., 2019), we enable open-vocabulary parsing by replacing lexical items in the input query with their corresponding concepts. Unlike NS-CL, which assumes access to ground truth tags, we learn the word-to-concept mapping through a tagging sub-module.

There are a few works that similar to our paper apply neurosymbolic reasoning in the robotics domain. ProgramPort (Wang et al., 2023) uses a CCG parser to construct programs, CLIP (Radford et al., 2021) for grounding attributes and learn a specialized pick-and-place module for selecting affordances on a top-down 2D image end-to-end. In Kalithasan et al. (2022), the authors use a similar neural scene encoder and semantic parser to our framework, but focus on learning transition models that can predict future states of objects for planning. Similarly, PDSketch (Mao et al., 2023) defines a DSL that allows human to draw program sketches for specific tasks, and learn elaborate transition models that include continuous parameters for actions. Our work differentiates itself by introducing the latent scene graph representation, which already contains action-related parameters (i.e., grasp poses) and focuses on generalizing semantic parsing and reasoning.

Language-guided manipulation

In the robotics field, language-conditioning has been an emergent theme in RL-based (Jiang et al., 2019; Luketina et al., 2019) and IL-based (Jang et al., 2022; Lynch and Sermanet, 2020; Stepputtis et al., 2020) manipulation. Such methods require prohibitive training resources or several hours of human teleoperation data, dedicated in fixed task settings. Shridhar et al. (CLIPort) (Shridhar et al., 2021) proposed to combine the pretraining visual-language alignment capabilities of CLIP (Radford et al., 2021) with spatial precision of TransporterNets (Zeng et al., 2020) to solve a range of language-conditioned manipulation tasks with efficient imitation learning. However, CLIPort struggles to ground expressions that require reasoning about arbitrary visual concepts and complex relationships between objects. Several other works propose disentangled pipelines for vision and action, with language primarily used to guide vision (Blukis et al., 2020; Chen et al., 2021; Hatori et al., 2017; Misra et al., 2014; Shridhar and Hsu, 2018). The guiding process is implemented via relevancy clustering of LSTM-generated image-text features (Shridhar and Hsu, 2018) or element-wise fusion of images with sentence-wide text embeddings (Chen et al., 2021; Hatori et al., 2017). Such holistic feature fusion approaches fall short to use richer object-word alignment, as motivated in the previous section. Instead, in our work, we employ a neurosymbolic framework that utilizes explicit semantics about words and phrases and their correspondence to referring expressions in language commands. In Misra et al. (2014), a parser is used to translate language instructions to formal programs operating on scene graphs, similar to our approach. However, programs and scene representations are built with a constituency parser and heuristics, respectively, thus being limited to the modeled vocabulary of concepts. In our work, we use deep neural nets to do parsing and scene representation, as well as object-concept grounding, therefore entertaining benefits from both explicit semantics and representational strength of deep networks.

A plethora of recent works use large language models (LLMs) as semantic parsers to map natural language into Python-based programs composed of primitives (S Huang et al., 2023; W Huang et al., 2023; Jin et al., 2023; Liang et al., 2022b; Zeng et al., 2022b), hence gaining open-vocabulary generalizability due to the Internet-scale pretraining of the language model. However, such works rely heavily on prompt engineering and incontext examples to steer the LLM generation, making the system brittle and unreliable. Further, they require closed APIs or intense computational resources as part of the overall architecture, thus hindering its real-time applicability, which is essential in robotics. Our works uses semantic parsing that is trained bottom-up from data, while maintaining open-vocabulary generalization by decoupling domain vocabulary from the DSL primitives.

Methodology

Our architecture is comprised of four components: (a) a scene encoder (hybrid), (b) a language parser (neural), (c) a dedicated language that implements a library of reasoning/action primitives, paired with a program executor (symbolic), and (d) a set of concept grounding modules (neural). Given a visual world state, the scene encoder constructs a scene graph representation that embeds object features as nodes and their spatial relations as edges. The language parser translates the input natural language query into the underlying program, expressed in our language, and the program executor executes it as a sequence of message passing steps in the extracted scene graph. The concept grounders are used to interface words from the query that represent concepts with their matching objects in the scene representation. The overall framework with a running example is illustrated in Figure 2.

Since our focus in this work is the application of the system in open-vocabulary fashion, we make two important modifications to previous works. First, we decompose the language parser into two sub-modules: a tagger network that replaces words in the query with their corresponding concept tags and a seq2seq network that translates the abstracted sequence to the final program. This setup enables us to parse potentially new vocabulary, as long as the tagger has recognized the corresponding concept correctly. Second, we replace hand-crafted relation primitives and attribute classification with object-concept grounding networks, opting to generalize to unseen concepts by leveraging the similarity semantics of pretrained word embeddings used to represent the concepts.

Scene encoder

Given an input RGB-D pair of images, we first apply an off-the-shelf object detector (He et al., 2017) in RGB for instance segmentation and crop the N detected object instances ${I_{n} \in R^{h_{n} \times w n \times 3}}_{n = 1}^{N}$ . Segmented objects are projected to 3D space using the camera intrinsics and approximated with a 3D bounding box $b_{n} = {(x_{n} y_{n} z_{n} l_{n}^{x} l_{n}^{y} l_{n}^{z})}^{T}$ , normalized according to the dimensions of the workspace. The object boxes are used to mask object views from a top-down orthographic projection, providing a heightmap per object I^ˆ_n ∈ $R^{h_{n} \times w_{n}}$ . We then construct a scene graph $G$ = { $V, E, X_{V}, X_{E}$ } with nodes $V$ = {1,…,N}, edges $E$ = $V$ × $V$ , node features $X_{V} = {x_{n}^{V} = (v_{n}, g_{n}), n \in V}$ and edge features $X_{E}$ = {r_nm, (n,m) ∈ $E$ }.

Visual encoder

We pass the cropped RGB images I_n to a pretrained network H: $R^{h_{n} \times w_{n} \times 3}$ → $R^{D_{v}}$ , comprised of up to the penultimate layer of an ImageNet (Deng et al., 2009) pretrained ResNet-50 (He et al., 2015) and kept frozen. The resulting feature maps are flattened to a single vector representation $v_{n} = H (I_{n}^{c})$ per object, of size D_v.

Grasp synthesis

We utilize a pretrained vision-based grasp synthesis network G: $R^{h_{n} \times w_{n}}$ → $R^{5}$ (e.g., GR-ConvNet (Kumra et al., 2019)), that receives the input object heightmaps I^ˆ_n and generates pixel-level masks G (I^ˆ_n) = (Φ,T,Q)_n ∈ $R^{3 \times h_{n} \times w_{n}}$ , where Φ,W,Q are each $R^{h_{n} \times w_{n}}$ maps that contain the rotation with respect to the camera frame ϕ_n, the grasp width ω_n, and the grasp quality q_n, respectively. We transform the grasp predictions in the world reference frame and select the center point (u_n,v_n)^world that gives the grasp proposal with the best quality for each object g_n: = max_Qn G (I^ˆ_n) = max_Qn (Φ,T,Q)_n, so that $g_{n} = (u_{n}^{world}, v_{n}^{world}, ϕ_{n}, ω_{n}, q_{n}) \in R^{5}$ .

Relation encoder

We encode each pair-wise spatial relation between two objects (n,m) ∈ $E$ with the concatenation of their normalized 3D boxes [b_n ; b_m], as well as some binary relation features ζ(b_n,b_m) ∈ {0, 1} that we extract from their boxes (e.g., $[x_{n} + l_{n}^{x} / 2 \leq x_{m} - l_{m}^{x} / 2]$ , with [·] denoting evaluating the input condition for true/false).

Formally, each edge representation in our scene graph is given by:

$r_{n m} = [b_{n}; b_{m}; ζ (b_{n}, b_{m})]$ .

We find that the extra binary features are essential for successfully grounding concepts such as “behind,” as they contain more fine-grained relations about the object pair (e.g., overlap between objects in x-dimension). See Appendix B for more details.

Language parser

The language parser consists of two sub-modules, a tagger network that identifies concepts in the input query and a seq2seq network for generating the program. To deal with potentially unseen vocabulary, the seq2seq network generates only the primitive functions of the overall program, whose arguments are restored from the query via a tag-conditioned attention linear sum assignment (LSA) module.

Concept tagger

We treat concept tagging similar to named entity recognition task in NLP (Kim Sang and De Meulder, 2003), where we map each word in the input query w_1:T to a tag c_1:T, from a set of concept tags {∅, Category, Color, Material, Relation, Location, Hyper-Relation}. Even though we can learn tagging with a shallow from-scratch network, we experimentally find that fine-tuning a pretrained language model achieves better generalization performance with less data (see Section 4.3). To that end, we fine-tune a pretrained distilBERT_base (Sanh et al., 2019) model. We use WordPiece tokenization (Song et al., 2021) and adopt the IOB scheme to deal with sub-word—tag misalignment (i.e., B: start of concept; I: continuation of concept; O: not a concept). The tokens after the embedding layer e_1:T are cached, as they will be matched to arguments of the final program through the attention LSA module. An example of tagging is given in Figure 2, and more are shown in Appendix C.

Seq2Seq encoder-decoder

We replace words that are mapped to concepts with the corresponding tag and feed the replaced sequence as input to a RNN-based seq2seq network, enhanced with an attention layer between the encoder and decoder (Bahdanau et al., 2014). A two-layer Bi-GRU (Chung et al., 2014) of hidden size D_π encodes the input sequence into hidden states $h_{t}^{e n c} = Bi ‐ GRU (e_{t}, h_{t - 1}^{e n c})$ and a two-layer GRU decoder of hidden size D_π generates the sequence of primitive functions π _τ = softmax $(Θ_{π} \cdot [h_{τ}^{d e c}; a_{τ}])$ , selected through greedy decoding from the primitives library Π, using a linear layer Θ_π ∈ $R$ ^D_π×|Π|. Here, $a_{t} = \sum_{τ} α_{t τ} h_{τ}^{d e c}, a_{t τ} = softmax (h_{t}^{e n c} \cdot Θ_{a t t n} \cdot h_{τ}^{d e c})$ denotes the weighted average of the attention scores over the hidden encoder states, where τ = 1,…, $T$ the steps of the generated program.

Tag-conditioned attention LSA

For each generated primitive function π_τ that receives concept arguments, only words tagged with the corresponding concept C_τ should be selected (e.g., C_τ = Color for π_τ = filter_color). We filter word tokens that satisfy this constraint and consider their normalized attention scores ˆa_tτ = {a_tτ/^∑_t a_tτ |c_t = C_τ}. Intuitively, the word t whose hidden state was the most attended in order to generate the function π_τ corresponds to the argument of the function. However, we experimentally find that when multiple instances of the same primitive appear in the program, not always the matching argument corresponds to the maximum attention score. We then want to select the configuration of unique function-arguments pairs (τ,t) that maximizes the attention scores across functions ^∑_τ aˆ_tτ, which is equivalent to the linear sum assignment problem, solved efficiently by the Hungarian matching algorithm (Kuhn, 1955). The cached embedding e_t is used as the argument for primitive π_τ for each selected pair.

Concept grounding

The purpose of concept grounders is dual: (a) to match scene objects n ∈ $V$ with attribute concepts (e.g., “bowl” for category, “red” for color, “plastic” for material, etc.) and vice versa (visual) using their visual features v_n, and (b) to match object pairs n,m ∈ $E$ with spatial concepts (binary relations, locations, and hyper-relations) based on their pair-wise relation features r_nm (spatial). Figure 3 illustrates the architecture of the grounder networks and how to run inference for implementing basic visual/spatial reasoning primitives of our library, namely $filter, query, relate, locate,$ and $hyper relate$ .

Figure 3.

From left to right: (a) A visual grounder (VG) network is used to ground attribute concepts to object instances and vice versa. The program executor invokes VG to perform (b) filtering and (c) querying primitives by computing similarity scores for object-concept pairs. A concept memory (M_C) provides concept values and their embeddings to enable the VG to query over all encountered concept values. (d) A spatial grounder (SG) network is used to ground relation concepts to object pairs. The program executor invokes SG to resolve: (e) relations, locations, and hyper-relations. The relation and location primitives can be implemented via the relation grounder, while hyper-relations are resolved via a dedicated hyper-relation grounder network.

Visual grounders

We implement a module F^α per attribute concept α ∈ {Color, Material, Category} that estimates a similarity score between a visual feature v_n of an object and a concept embedding e_c, which corresponds to the (averaged) embedding(s) of a concept word (/phrase) c. The similarity score is given by $F^{α} (n, c) = < {\hat{v}}_{n}, {\hat{e}}_{c} >$ , where:

{\hat{v}}_{n} = \frac{Θ_{2}^{α} \cdot gelu (Θ_{1}^{α} v_{n})}{{∥ Θ_{2}^{α} \cdot gelu (Θ_{1}^{α} v_{n}) ∥}_{2}}, {\hat{e}}_{c} = \frac{Θ_{2}^{e} \cdot gelu (Θ_{1}^{e} e_{c})}{{∥ Θ_{2}^{e} \cdot gelu (Θ_{1}^{e} e_{c}) ∥}_{2}}

with

Θ_{1}^{α} \in R^{D_{j} \times D_{v}}, Θ_{2}^{α} \in R^{D_{j} \times D_{j}}, Θ_{1}^{e} \in R^{D_{j} \times D_{e}}, Θ_{2}^{e} \in R^{D_{j} \times D_{j}}

trainable matrices, D_j the joint embedding dimension, <, > the cosine similarity metric and gelu the GeLU activation function (Hendrycks and Gimpel, 2016).

Following Yu et al. (2018), we train VG using a hard margin hinge loss with in-batch sampling of negative object-concept pairs. To do inference, we handle the two uses of VG separately. For

filter

, we need to select that subset of objects n whose similarity difference from all other objects is not above a fixed margin γ, while for

query

, we want to select the concept value c from the set of all possible attribute concepts (maintained in the concept memory module

M

_C) that gives highest similarity with a single object n₁. The exact formulas are given in Table 1.

Table 1.

The library of reasoning primitives included in our language. For brevity we don’t enumerate all combinations of primitive and concept arguments, but illustrate the latter as a separate column. Visual modules interface with visual grounders and the scene’s visual features to reason about visual attributes. Spatial primitives interface with spatial grounders to resolve spatial relations, absolute relations (locations) and hyper-relations. Symbolic modules implement basic logic operations to incorporate integer and set semantics.

Spatial grounders

Resolving spatial relations comes in three flavors in our domain, namely: (a) binary relations (e.g., “left of”), that operate on pair-wise relation features r_nm, (b) absolute relations (i.e., locations—e.g., “leftmost”), that depend on the aggregation of all binary relations for a given object set n ∈ $V$ ₁, and (c) hyper-relations (e.g., “closer to/than”), that operate on relative relation features Δr_nmk = r_nm − r_nk between a source n and two target objects m,k ∈ $V$ . As locations can be expressed via binary relations, we only need to implement two spatial grounding networks F^R and F^H.

Formally:

F^{R} (n, m, c) = Θ_{j}^{R} \cdot \frac{Θ_{1}^{R} \cdot r_{n m} ⊙ Θ_{1}^{e} \cdot e_{c}}{{∥ Θ_{1}^{R} \cdot r_{n m} ⊙ Θ_{1}^{e} \cdot e_{c} ∥}_{2}}

F^{L} (n, c) = \sum_{m \in V_{1}} σ (F^{R} (n, m, c)), n \in V_{1}

F^{H} (n, m, k, c) = Θ_{j}^{H} \cdot \frac{Θ_{1}^{H} \cdot Δ r_{n m k} ⊙ Θ_{2}^{e} \cdot e_{c}}{{∥ Θ_{1}^{H} \cdot Δ r_{n m k} ⊙ Θ_{2}^{e} \cdot e_{c} ∥}_{2}}

where

Θ_{1}^{R} \in R^{D_{j} \times D_{R}}, Θ_{1}^{e} \in R^{D_{j} \times D_{e}}, Θ_{j}^{R} \in R^{D_{j} \times 1}, Θ_{1}^{H} \in R^{D_{j} \times D_{H}}, Θ_{2}^{e} \in R^{D_{j} \times D_{e}}, Θ_{j}^{H} \in R^{D_{j} \times 1}

are trainable matrices, D_j denotes the joint embedding dimension, and ⊙ the element-wise product. Spatial grounders are designed to produce binary matching scores between concepts and any object pair of the scene, as in Hu et al. (2016); hence, the architectural difference between VG and SG networks. We train using a binary cross-entropy loss over all relations in all object pairs of each scene.

Primitives and program execution

Primitives library

We define our library of reasoning primitives Π similar to the CLEVR domain (Johnson et al., 2016), which we formally present in Table 1. The library includes two extra operational primitives, namely: (a) scene, which initializes an execution trace returning all objects $V$ , and (b) unique, which returns the object contained in a single-element object set. Action primitives are terminal nodes in a program that control the robot arm via a custom first-party control API, whose implementation is orthogonal to our DSL. In our implementation, the grasp primitive instructs the robot to grasp an input object n using its grasp proposal g_n as the target end-effector pose for an inverse-kinematics solver.

Program executor

Primitives are developed as functions in a Python API. Our type system supports basic variable types, as well as two special types for representing an object and an object set through their unique indices in the scene graph nodes V. All functions share the same type system and input/output interface and thus can be arbitrarily composed in any order and length. As in Yi et al. (2018), branching structures due to double argument primitives (e.g., and) are handled via the usage of a stack, allowing program execution as a chain of module calls, each receiving as input the output of the previous step and accessing the stack in case of double arguments. Whenever there is a type mismatch between expected and retrieved inputs/outputs, a suitable response is returned, enforcing interpretability by explaining to the user which reasoning step failed. To speed up computation, we first group all program steps that require concept grounding to do a single batched forward pass per grounder, and mask the network predictions during execution according to the previous steps.

Training paradigm

The training process entails two optimization objectives: (a) the correctness of the parsed program and (b) object-concept matching of the concept grounders. Following insights from prior works (Mao et al., 2019), we train using a curriculum learning approach. In particular, we first train the grounder modules to ground attribute concepts to objects (VG) and spatial concepts to object pairs (SG). To that end, we isolate input/output pairs from filtering, querying and relation-based operations from the execution traces of our dataset’s program annotations and express them as binary masks over the graphs nodes (VG)/edges (SG). We train the grounders on the checkpoint datasets and freeze their weights for the following steps. For language parsing, we first train the concept tagger on a small split of tagged queries and then the entire language parser objective following (Yi et al., 2018). First, we select a small diverse split of the training data, sampling uniformly from all different templates, and train using the ground truth programs with a cross-entropy loss. Finally, we combine the language parser with the grounders and the program executor and train the system end-to-end in the remaining scenes with REINFORCE (Williams, 1992), using only the correctness of the executed program as the reward signal.

Experiments

We structure our experimental evaluation as follows: First, we present the details of the synthetic dataset generation and the collected real-world dataset. In the two subsequent sections, we evaluate the visual reasoning capabilities of the proposed model through VQA, where we compare our approach with previous baselines in terms of accuracy, sample-efficiency and generalization to unseen vocabulary. In the next section, we study the transfer performance of our method in real scenes via few-shot fine-tuning of our visual grounder network. We further integrate our method with a robot framework and perform end-to-end experiments for an interactive object grasping scenario, where we examine the distribution of failure modes across system components in scene-instruction pairs with increasing complexity. Finally, we show that our method can be extended to more manipulation tasks via few-shot fine-tuning of the language parser.

Datasets

We present the synthetic and real versions of the dataset we release, termed: Household Objects placed in Tabletop Scenarios (HOTS). We refer the reader to Appendix A for more details on both versions.

SynHOTS

We collect from available resources a catalogue of 58 3D object models from five types (fruits (6), electronics (4), kitchenware (18), stationery (17) and edible products (13)), organized into 25 object categories, 10 color and 8 material concepts. As we strive for natural interaction, we also include instance-level object annotations according to their brand, variety or flavor (e.g., “Coca-Cola” vs “Pepsi,” “strawberry juice” vs “mango juice,” etc.) We render synthetic scenes in the Gazebo environment (Koenig and Howard, 2004) and generate around 8k training and 1.6k validation RGB-D pairs, additionally equipped with parsed semantic scene graphs, containing all location, grasp, attribute, and relation information for each object. For annotating our scene graphs with language data, we develop on top of the CLEVR generation engine (Johnson et al., 2016) and produce language-program-answer triplets from synthetic task templates by sampling concepts from the scene graphs. We extend the standard VQA templates of CLEVR to incorporate our designed DSL, as well as extra REF and grasping tasks, ending up with 11 distinct task families, spawning a total of 295 task templates, with rich variation in phrasing/syntax. For the VQA task (SynHOTS-VQA), we instantiate 66 templates for each scene (6 per task family) and generate around 500k training and 100k validation question-program-answer samples.

HOTS

In order to evaluate the performance of our model in natural scenes, we record a dataset of real RGB-D images captured from a robot’s camera. The real household objects used in this dataset, together with our dual-arm robot setup, are shown in Figure 4. The object catalogue is a subset of the synthetic one but includes a few novel attributes, for a total of 48 object instances with 25 category, 10 color, and 7 material concepts in 108 unique scene configurations. Twenty two scenes that provide a fair representation of all concepts are held out for potential fine-tuning experiments, and the 86 remaining scenes are used for testing. We extract scene graphs and repeat the language-program-answer data generation step as in simulation, ending up with 5676 scene-question pairs.

Figure 4.

A subset of the object catalogue included in the HOTS dataset (left) and an image of our real robot setup from the opposite perspective (right).

VQA evaluation in simulation

Setup

We compare our method with three holistic (Perez et al., 2017; Santoro et al., 2017; Yang et al., 2015) and the original NS-VQA (Yi et al., 2018) baseline. The holistic models are trained using the implementation and hyper-parameters from Perez et al. (2017) and NS-VQA is a replica of the original work, with the executor component adapted to incorporate our primitives library. We use a ResNet50 (He et al., 2015) backbone for visual feature extraction and sample 4000 images from our dataset to train the NS-VQA attribute classifiers and our grounders. NS-VQA and our method are pretrained with 300 programs sampled uniformly from all question families and fine-tuned with REINFORCE for the rest of the dataset. We note that our method additionally pretrains the tagger component of our parser with 500 question-tag pairs. We use Adam optimizer with batch size of 64 and train for 2k iterations in pretraining and 2M iterations in REINFORCE stage, using learning rates of 3 · 10⁻⁴ and 10⁻⁵, respectively. The reward is maximized over a constant baseline with a decay weight of 0.9.

Accuracy

We report results in SynHOTS-VQA validation split in Table 2, organized by question type. The metric used is final VQA accuracy, measured as top-1 prediction in the case of holistic and the correctness of the executed program in the case of neurosymbolic baselines. Our model achieves near-perfect accuracy and is consistently above all holistic baselines across all question types, with the most significant margin in counting questions. Compared to NS-VQA, our approach achieves on-par performance, with a small drop due to the reformulation of the primitives library to be vocabulary-agnostic and the addition of the concept tagging bottleneck. We show in the next section that this drop is a favorable trade-off between performance in validation (seen) and generalization test (unseen) splits.

Table 2.

VQA accuracy (%) per question type and overall for the validation split of our synthetic dataset. The REF column denotes referring expression questions, that do not apply to baselines that are trained for closed-VQA.

Method	Count	Exist	Compare number	Compare attribute	Query	REF	Overall
CNN-LSTM-SAN (Yang et al., 2015)	58.9	77.1	73.9	70.2	79.8	-	72.0
CNN-LSTM-RN (Santoro et al., 2017)	86.3	93.7	87.05	91.6	92.8	-	90.3
CNN-GRU-FiLM (Perez et al., 2017)	88.3	93.4	89.35	92.9	93.2	-	91.4
NS-VQA (Yi et al., 2018)	98.6	99.4	98.1	99.6	95.6	99.0	98.2
Ours	95.5	97.9	97.0	99.7	94.0	99.6	96.9

Sample-efficiency

We further analyze the sample-efficiency of our method compared to baselines in Figure 5, both in terms of pretraining and REINFORCE fine-tuning. Regarding tagger pretraining, we see that with a powerful pretrained model such as distilBERT (Sanh et al., 2019) we achieve 99.8% F1-score on the validation tags with only 500 samples. A GRU baseline with pretrained GloVe embeddings (Pennington et al., 2014) needs 2k samples to achieve the same performance. Regarding supervised pretraining, we see similar performance between NS-VQA and our method, with the latter being more efficient in weaker REINFORCE supervision (2k and 10k question-answer pairs). We believe this result is due to our two-step parser implementation, as, for example, for as little as 180 programs, the training examples most likely do not sufficiently cover the concept vocabulary of the domain for the NS-VQA parser, whereas in our method concept words are replaced by tags, which suffice in number. Finally, our method is the most sample-efficient in terms of required question-answer pairs, with a significant gap compared to holistic approaches, which comes at the cost of just a few hundred question-programs annotations for supervised pretraining.

Figure 5.

Sample-efficiency experiments on SynHOTS-VQA. From left to right: (left): F1-score of concept taggers versus number of tagged annotations used during pretraining, (middle): VQA accuracy versus number of pretraining programs; different curves indicate different amounts of data used at the REINFORCE stage, (right): VQA accuracy versus number of training question-answer pairs; NS-VQA and our method are pretrained with 500 programs.

Generalization to unseen vocabulary

In this subsection, we wish to evaluate the generalization performance of our model in unseen vocabulary, that is, testing in words to describe concepts that were not part of the training data. We conduct experiments in four splits, three for unseen attribute concepts and one Open, where we use unseen instance-level descriptions of a unique object in the scene (e.g., “Coca-Cola,” “mango juice,” etc.—check Appendix A for full list). We perform several ablation experiments where we either use attribute labels from ground truth scene graphs or the actual perception pipeline (classifiers for NS-VQA and VG for our model), as well as ground truth tags instead of taggers predictions. The purpose here is to decompose the error rate to tagger, seq2seq and VG errors, in order to understand which module is the main bottleneck for generalization. For a fair comparison with the NS-VQA baseline, for this experiment we initialize and freeze the word embedding layers of both methods with GloVe (Pennington et al., 2014) and use our from-scratch GRU tagger baseline (pretrained with 2k question-tag pairs). Results are summarized in Table 3. The vocabulary-aware baseline of NS-VQA fails to parse unseen concept words, as they are not part of the training data, while our approach achieves significantly higher accuracy, with near-perfect results when evaluating only the seq2seq network with ground truth perception and tags. We identify VG as the main generalization bottleneck (17.7% overall accuracy drop when adding VG vs 7.2% when adding the tagger), with still however a large margin from NS-VQA.

Table 3.

VQA accuracy (%) in generalization-test splits that contain questions with unseen vocabulary describing Category, Color and Material concepts. Open denotes the use of an unseen word to describe an object at instance-level. We note that a question might contain unseen words from multiple categories, so the Overall column does not correspond to the average.

Method	Unseen category	Unseen color	Unseen material	Open	Overall
NS-VQA (lang→prog)	30.4	13.1	22.7	29.9	28.8
w/GT-Perc.	38.6	19.0	29.6	36.3	35.2
Ours (lang→tag→prog)	68.4	58.2	78.4	86.6	77.1
w/GT-Perc.	73.2	64.0	83.0	93.0	87.6
w/GT-Perc. + GT-Tags	94.1	82.9	95.1	95.1	94.8

Adapting to real scenes

In this subsection, we wish to assess the transferability of our model in natural scenes by evaluating visual reasoning performance in the HOTS dataset. We highlight that unlike holistic approaches, which require both vision and text data to be adapted, the modular nature of our approach allows us to bridge the sim-to-real gap solely in the vision domain, only fine-tuning the VG in real images and transferring the language parser without any further training. We evaluate in two setups, namely: (a) HOTS-Recognition, where we only test the visual pipeline by treating attribute noun phrases as class labels like in the classification task, and (b) HOTS-Reasoning, where we test the end-to-end system for REF and VQA tasks separately. For the first split, we use VG for querying attribute concepts of input object images and report the percentage of correct top-1 predictions as accuracy. For this experiment, we directly provide the concept embeddings of all possible attribute tags from our concept memory. We initialize the VG with the synthetic pretraining weights and fine-tune in different amounts of training examples per object instance (1, 5, 20), as well as in full dataset. A no-pretrained VG baseline that is only trained in real data is also included. Results are summarized in Table 4. We observe that our method can be efficiently transferred to real scenes, as 20 labeled examples per object instance achieves very similar performance to training from scratch in the entire dataset, both in the attribute recognition as well as in the end-to-end reasoning tasks. We identify that the main bottleneck here is not the sim-to-real gap but the inclusion of unseen attribute concepts in HOTS compared to SynHOTS, which require more data as they are effectively learned from scratch by the visual grounder.

Table 4.

Top-1 accuracy (%) for classifying attributes—category (Cat), color (Col), and material (Mat)—as well as execution accuracy for end-to-end REF and VQA tasks in annotated scenes of our HOTS dataset. GT denotes using ground truth attribute labels from scene graphs. The #Data column denotes the number of fine-tuning examples per object instance for the VG.

Setup		HOTS-Perc.			HOTS-Reas.
Method	#Data	Cat	Col	Mat	REF	VQA
GT	-	100.0	100.0	100.0	96.8	96.1
VG-no-pretrain	full	92.9	92.1	90.4	90.1	88.2
VG-pretrain	0	34.7	40.4	13.9	26.6	29.1
VG-pretrain	1	43.2	44.4	60.8	45.4	47.7
VG-pretrain	5	62.5	67.6	73.1	66.1	61.9
VG-pretrain	20	90.5	89.9	94.4	89.8	86.6
VG-pretrain	full	93.4	91.8	95.7	90.9	88.1

Interpretable interactive object grasping

In this subsection, we integrate our method with the grasping pipeline of Oliveira et al. (2016) and evaluate its end-to-end behavior for an interactive object grasping task. An illustration of the setup and experiments is given in Figure 6. We conduct several trials, in which we randomly place objects on a table and instruct the robot to grasp an object in real time. The scenes always include distractor objects of a same attribute, requiring the user to use other attributes and/or spatial relations to uniquely refer to the goal object. We note that the instructor is not limited to the concept vocabulary of our domain and can use arbitrary phrasing, potentially outside the syntax of our scripted templates. The interpretable nature of our system allows us to examine the parsed program execution traces and diagnose the source of failures, including: (a) perception, where there is either a localization error or a grounder has given an incorrect match, (b) reasoning, where the parsed program is incorrect, or (c) grasping, where the grasping fails (e.g., due to collisions).

Figure 6.

A sequence of snapshots capturing the setup of our robot framework in Gazebo (top) and in a real-world environment (bottom). We generate a random scene and command the robot to grasp a specific item with a text instruction, referring to attributes/relations between objects (in pink). In the snapshots, we demonstrate the robot during the picking action (each-left) and the localization results in RViz (each-right), as well as the parsed program corresponding to the query (each-bottom).

We report results in synthetic scenes separated in four splits, comprised of different levels of scene and query complexities (see Figure 7). We generate 10 scenes per split and conduct five trials for each, for a total of 200 scene-instruction pairs. For the real experiments, we conduct a total of 12 trials using objects from the HOTS dataset and the adapted visual pipeline of the previous section. Results are summarized in Table 5. We observe that in both setups the averaged error rate is similar (20%−25%), with the reasoning module being the most robust to grasping instructions across all trials. Exceptions are a few queries in cases of complex question splits. Such failures are mostly due to unique phrasing of the instruction by the human instructor, with one case of referring to an unknown spatial concept (e.g., “between”). Perception errors occur more frequently in the crowded scene setup, due to partial views of objects leading to occlusion. We include a video with robot demonstrations as supplemental material. The overall results showcase that the system can indeed serve as an accurate and interpretable interactive robotic grasper, while having relative robustness to free-form instructions.

Figure 7.

Example trials from the four splits used for simulated grasping experiments, namely: (a) Scattered scenes—simple queries, (b) crowded scenes—simple queries, (c) scattered scenes—complicated queries, and (d) crowded scenes—complicated queries. The green box denotes the target item, red denotes a distractor item of the same attribute and the dark box denotes all items involved in the reasoning process.

Table 5.

Evaluating the system for an interactive object grasping task in synthetic (top) and real (bottom) scenes of incremental query and scene complexity. The interpretable nature of our approach allows us to decompose the failure modes across the different modules.

env	Split		#Trials	#Fail.	#Perc. Fail.	#Reas. Fail	#Gr. Fail.
env	Query	Scene	#Trials	#Fail.	#Perc. Fail.	#Reas. Fail	#Gr. Fail.
Sim	Simple	Scattered	50	4 (8.0%)	2 (4.0%)	0 (0.0%)	1 (2.0%)
	Simple	Crowded	50	8 (16.0%)	4 (8.0%)	0 (0.0%)	4 (8.0%)
	Complex	Scattered	50	8 (16.0%)	4 (8.0%)	2 (4.0%)	2 (4.0%)
	Complex	Crowded	50	19 (38.0%)	10 (20.0%)	3 (6.0%)	6 (12.0%)
	Total		200	39 (19.5%)	20 (10.0%)	5 (2.5%)	13 (6.5%)
Real	Simple	Scattered	3	0 (0.0%)	0 (0.0%)	0 (0.0%)	0 (0.0%)
	Simple	Crowded	3	1 (33.3%)	1 (33.3%)	0 (0.0%)	0 (0.0%)
	Complex	Scattered	3	0 (0.0%)	0 (0.0%)	0 (0.0%)	0 (0.0%)
	Complex	Crowded	3	2 (66.6%)	1 (33.3%)	1 (33.3%)	0 (0.0%)
	Total		12	3 (25.0%)	2 (16.6%)	1 (8.3%)	0 (0.0%)

Extending to more manipulation tasks

In this subsection, we explore how efficiently our model can adapt to more complex manipulation tasks beyond grasping. To that end, we implement two extra control primitives, which like grasp, act as terminal nodes in the parsed program, receiving unique indices of objects to manipulate and control the arm based on the grasp poses of the objects with an IK-solver. In particular, we implement: (a) pick and place, which receives two object inputs and a relation concept argument that map to what to pick, where to place, and how to place it, respectively, and (b) sort, which receives a set of objects to sort into a fixed container item (see Figure 8). We structure new templates for these tasks and generate 10 instruction-program pairs for 50 novel synthetic scenes with the same constraints as the grasping task, for a total of 500 pairs. We fine-tune our language parser in the new instructions (while keeping the rest of the system fixed) and report results in Table 6, using the same setup as the previous section for 100 trials per task in simulation. As with grasping, we observe that the reasoning module is robust in query complexity and task success is limited only by perception and grasping modules, in cases of crowded scenes. We further integrate policies obtained through behavioral cloning as control primitives and demonstrate more complex, long-horizon manipulation tasks in our supplemental material.

Figure 8.

Extending to more manipulation tasks: (top): pick an object and place it relative to another object, and (bottom): sort all objects in a pre-defined container according to a reference object.

Table 6.

Evaluating the system for interactive $pick and place$ and $sort-by-reference$ manipulation tasks in synthetic scenes of incremental query and scene complexity. Results include parsing accuracy (%), measured as the percentage of correctly generated programs for the input query, as well as success rate (%) of the overall behavior, incl. perception and grasping modules.

Split		#Trials	Pick and place		Sort by reference
Query	Scene	#Trials	Pars. Acc	Succ. Rate	Pars. Acc	Succ. Rate
Simple	Scattered	25	100.0	92.0	100.0	96.0
Simple	Crowded	25	100.0	80.0	100.0	76.0
Complex	Scattered	25	96.0	88.0	88.0	92.0
Complex	Crowded	25	96.0	80.0	96.0	64.0
Total		100	98.0	85.0	96.0	82.0

Comparisons with foundation models

In this section we explore the comparative performance, as well as integration potential, of our neurosymbolic framework with approaches relying on modern foundation models (Tziafas et al., 2023; Tziafas and Kasaei, 2024), such as LLMs (OpenAI, 2023) and VLMs (Radford et al., 2021) for zero-shot semantic parsing and grounding. To that end, we perform three experiments studying relative performance in different parts of the pipeline, including parsing, grounding and end-to-end grasping from a language instruction. To strengthen our evaluation we also utilize the dataset OCID-VLG (Tziafas et al., 2023), which provides referring expression queries, accompanied with parsed program and ground-truth mask and grasp annotations for 1763 unique scenes from the OCID dataset (Suchi et al., 2019). To examine out-of-distribution generalization, we use the novel-classes split provided from the authors, which includes object category objects unseen during training. We note that NS-MAN is not equipped to deal with novel concept types, since it is trained only on those existing in the training data. However, we conduct evaluations in this split to explore trade-offs between foundation model-based and our neurosymbolic approach, as well as potential for integrating the two.

Semantic parsing

We compare our work with an LLM parser based on gpt-4o (OpenAI, 2023), prompted with 8-shot examples from question-program pairs of our dataset, both for the REF task of SynHOTS and for OCID-VLG. We report final accuracy after program execution, using ground-truth perception in all methods, in order to understand the upper bound performance with perfect perception. In the case of OCID-VLG, we train from scratch our framework similar to SynHOTS. Results are in Table 7. We observe that with 8 in-context examples we can guide the LLM to provide perfect parsing in all splits, with errors only in identifying object instance names unseen in the prompt in OCID-VLG test (e.g., “Feh package” as tissue box category). Naturally, NS-MAN cannot generalize to unseen concept types and mostly fails in all cases with such queries.

Table 7.

Parsing accuracy (%), measured as the percentage of correctly generated programs for the input query with ground-truth perception. The LLM transfers few-shot given 8 query-program examples from each dataset, while the below approaches are trained in the full provided data. The val set contains seen concept types and vocabulary, while the test set contains novel vocabulary for SynHOTS and novel vocabulary and concept types for OCID-VLG.

Method	SynHOTS-REF		OCID-VLG
Method	Val	Test	Val	Test
gpt-4o (OpenAI, 2023)	100.0	100.0	100.0	91.8
NS-VQA (Yi et al., 2018)	99.0	36.2	99.2	9.6
NS-MAN (ours)	99.8	87.1	99.2	52.8

Visual grounding

We compare the concept grounding capability of NS-MAN with foundation VLMs, such as CLIP (Radford et al., 2021) as well as gpt-4o (OpenAI, 2023) combined with set-of-mark (SoM) prompting from SAM (Kirillov et al., 2023), as in Yang et al. (2023). In the case of CLIP, we also use SAM to provide masks candidates, crop them and embed each one using CLIP to compute similarity with the language query. We note that both the validation and the test splits contain mixed queries that can contain category names, attributes or spatial relations. However, test split contains unseen vocabulary for SynHOTS and unseen category concepts for OCID-VLG. The spatial concepts are seen for both splits of both datasets. Results are given in Table 8. We observe that in all cases of seen concept types, both splits of SynHOTS and validation split of OCID-VLG, NS-MAN outperforms the zero-shot methods. As expected, our grounder’s performance in unseen concept types of OCID-VLG test degrades drastically (44.3 %). The margin is mostly due to spatial relation queries, which the SAM + CLIP baseline has no way to reason about. The SoM-GPT baseline has very strong performance in category and attribute grounding, but still struggles to perform multi-hop relational reasoning in an end-to-end fashion, just from the provided image and annotated markers. Finally, when we combine our semantic parser with the zero-shot VLM-based grounders, essentially just replacing our grounding modules with the VLM, we see a drastic boost in grounding accuracy. We believe this result encourages the idea of using parsing as a proxy for referring expression reasoning, even in the case of generalist pretrained models, when applied to complex tabletop scenarios.

Table 8.

Grounding accuracy (%), measured as the percentage of correctly grounded referring expressions, measured as final masks with IoU > 0.5 with the ground-truth mask. Combining VLM-based grounders with semantic parser improves the reasoning capability of the VLMs in cases of spatial reasoning queries.

Parser	Method	SynHOTS-REF OCID-VLG
Parser	Grounder	val	test	val	test
	NS-MAN (Ours)	99.6	79.4	94.0	44.3
-	CLIP (Radford et al., 2021) + SAM (Kirillov et al., 2023)	34.4	37.7	29.8	22.3
-	SoM-GPT (Yang et al., 2023) + SAM (Kirillov et al., 2023)	77.9	72.3	82.2	80.9
NS-MAN	CLIP (Radford et al., 2021) + SAM (Kirillov et al., 2023)	75.3	79.9	60.6	53.8
NS-MAN	SoM-GPT (Yang et al., 2023) + SAM (Kirillov et al., 2023)	92.5	90.9	87.1	85.8

Interactive object grasping

We compare our full NS-MAN framework with OWG (Tziafas and Kasaei, 2024), a recent zero-shot work relying solely on pretrained models to produce a suitable grasp pose from a natural language instruction. This method is essentially an integration of the SoM-GPT + SAM grounding approach considered above with GR-ConvNet (Kumra et al., 2019) for grasp synthesis, which is the same underlying model we are using in NS-MAN. We also compare with CROG (Tziafas et al., 2023), a baseline end-to-end language-guided grasping model proposed with the OCID-VLG dataset. We use the 4-DoF grasp annotations of OCID-VLG as ground-truth and report results in Table 9 using the Jacquard metric J@1 as in the original work (Tziafas et al., 2023), which considers the IoU and relative angle between the predicted and ground-truth grasp rectangles. Similar to grounding, we observe the same trends: Within distribution, NS-MAN provides the highest scores, with a delta of 7% from OWG. When moving to unseen category concept types, NS-MAN’s as well as the supervised baseline CROG’s performance drop drastically, while OWG maintains high performance, as it relies solely on zero-shot VLM grounding.

Table 9.

Jacquard metric J@1 (%), measured as the percentage of grasp predictions that have an IoU > 0.25 and a relative angle within 30° from ground-truth grasps. The val set contains seen query vocabulary, while the test set contains novel object queries.

Method	OCID-VLG
Method	Val	Test
CROG (Tziafas et al., 2023)	77.2	42.1
OWG (Tziafas and Kasaei, 2024)	75.8	72.6
NS-MAN (ours)	82.8	42.3

Overall, we interpret the above results as a trade-off. LLM/VLM-based methods offer significant benefits when considering completely unseen concepts. NS-MAN achieves better/on-par results when in-distribution, or when considering unseen vocabulary of the same concept types, while doing so by being trained completely bottom-up from synthetic data. On the one hand, if generalization is required, foundation model approaches are a favorable choice, with the additional costs of latency, cost, privacy, and other factors.

On the other hand, if efficiency-cost is required, and we assume that we have generated synthetic training data for all concept types that will be ever observed by the robot, our proposed framework serves as an effective lightweight alternative. Further, as the results of this section suggest, LLMs for semantic parsing and/or VLMs for grounding can be drop-in replacements to corresponding modules of NS-MAN, thus boosting generalization to unseen concept types if in-the-wild scenarios are to be considered.

Discussion

In this section, we reflect on our results with regard to specific topics and discuss limitations and future work.

Adapting to novel content

One important benefit of the modular versus holistic design is the ability to adapt to novel content by only adapting the related module, instead of the entire pipeline (Tziafas and Kasaei, 2022). We believe that this translates to important benefits in terms of development cycles, as it alleviates the need for collecting large-scale multimodal data for training an end-to-end model. In summary, the steps required for the proposed method to extend to novel concepts/tasks are:

New concepts require fine-tuning the VG (/SG) in an image-only dataset annotated with the novel concepts and transferring the rest of the system without any further adaptation. Even though from the experimental analysis of Section 4.4 we conclude that a few examples per new concept are sufficient for visual adaptation, continuously incorporating new visual concepts would eventually outscale the capacity of the VG or lead to catastrophic forgetting. In the future, we plan to experiment with vision-language foundation models (e.g., CLIP (Radford et al., 2021)) for zero-shot visual-language grounding. Similarly, spatial concepts would require fine-tuning the SG networks for the new spatial concepts, without facing similar capacity issues due to the scarcity of spatial concepts used in referring expressions (with 11 concepts in this work we cover more than associated benchmarks, e.g., Kazemzadeh et al. (2014)).

New tasks that involve new reasoning/control functionalities would require formally defining them as new primitives and integrating them in the DSL. New task-related templates have to be generated to train the language parser, like we do in Section 4.7. Even though our results suggest that the language parser can efficiently incorporate more tasks, the system is limited to the range of tasks that can be solved in a sequential fashion (chain of primitive steps), in order to be compatible with our DSL formalism. Extending to more complex logic like conditionals and loops (e.g., “Keep the soda inside the bowl until you see a new item on the right”) would require re-designing our language in an imperative rather than functional fashion. Alternatively, one desirable future direction is to replace our supervised parser with a large language model (Brown et al., 2020) for zero-shot parsing of instructions to Pythonic code, akin to Liang et al. (2022a) and Zeng et al. (2022a).

Handling failure via interactivity

With this work we wish to highlight the practical benefits of interpretability in the context of human–robot interaction applications. Beyond easiness of debugging and transparency of the model, this feature can augment models functionally by bringing humans in-the-loop. For example, by adding suitable responses when a module fails at execution time, we can employ the system in an online dialogue setup, enabling the user to give feedback on failures caused by either ill-formed queries or other ambiguities in the scene (see Figure 9). The unique primitive requires the input set to be unitary and therefore the execution will fail due to the presence of multiple matched objects. The system raises a relevant template response back to the user and integrates their feedback to correct the generated program. To achieve this, we process the feedback query to identify new present concepts from our concept memory. When new concepts are identified, the parsed program is re-structured appropriately and the system re-runs execution.

Figure 9.

By type-checking and adding failure wrappers in all of our primitive implementations, the system is able to identify sources of failure and return a suitable response to the user. In this example, the original query (first) is ill-posed as it refers to a soda object, while two sodas are present. This results in failure of the unique primitive, which will be prompted back to the user (second). The human responds with additional feedback (third) which results in a correct final gasping behavior (fourth). Such failure handling behaviors allow our model to interact naturally with human users in a dialogue setting.

Training time, real-time performance, and dynamic environments

Regarding training time, the entire curriculum training process discussed in Section 3.5 takes around 10 hours in a consumer GPU for the 500k scenes of SynHOTS-VQA. For inference, our end-to-end system (incl. the pretrained networks) can be used to produce a program at 4 fps in our hardware setup,¹ with the main bottleneck being Mask R-CNN for localization. In the future, we plan to integrate high-efficiency detectors to increase our throughput. Similarly to the previous subsection, failure handling in the implemented control primitives can be used to simulate closed-loop control, as in cases of dynamic environments the world state might change during execution. A failure wrapper around the grasp primitive verifies that the target object state is the same as when the execution trace started (i.e., scene primitive) and otherwise re-runs the program with the updated state.

Portability

Besides sample-inefficiency, holistic approaches are limited to the agents/environments that were used to generate training data. In contrary, our approach disentangles the actual policy (represented as a program) from the perceptual and motor components (represented as functions in the program), and hence can be transferred to new agents/environment with minimal effort. Similar to our experiments in Section 4.4, where we only adapt one module (VG) and transfer the overall system in a new visual domain, one could further replace the grasping module to use different arms or grippers and transfer to completely new robots and environments.

Conclusion

In this work, we bring together deep learning techniques for perception, grasp synthesis and NLP with symbolic program synthesis and execution in an end-to-end hybrid system, aimed for interactive robot manipulation applications. We design a dedicated language that implements visuospatial reasoning as primitive operations. We exploit linguistic cues in the input instruction to synthesize a program composed of such primitives. Programs interface with visual/spatial grounding and grasp modules to ground concepts and control the robot, respectively. We generate a synthetic tabletop dataset with rich scene graph and language-program annotations, paired with a real RGB-D scenes dataset, which we make publicly available. Extensive evaluation through a VQA task showcases that our method achieves near-perfect accuracy in-domain, while being fully interpretable and sample-efficient compared to baselines. Generalization experiments show that the vocabulary-agnostic formulation of our language and model enables better generalization to unseen concept words compared to previous works. Also, we show that with our modular design, the system can transfer to natural scenes with few-shot adaptation of the visual grounder, as well as transfer to more manipulation tasks with few-shot adaptation of the language parser module. We integrate our model with a robot framework and perform experiments for an interactive object picking task, both in simulation and with a real robot. Robot experiments demonstrate high success rate, and robustness to user instructions, with interpretability leveraged to actively detect reasoning failures and inform the user.

Footnotes

Acknowledgments

We gratefully acknowledge partial support from Google DeepMind through the Research Scholar Program for the project “Continual Robot Learning in Human-Centered Environments”.

ORCID iDs

Georgios Tziafas

Hamidreza Kasaei

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online. We make supplementary material available in .

Note

Appendix

References

Achlioptas

Abdelreheem

Xia

, et al. (2020) ReferIt3D: neural listeners for fine-grained 3D object identification in realworld scenes. In: 16th European Conference on Computer Vision (ECCV), Glasgow, United Kingdom, 23–28 August 2020.

Ahn

Brohan

Brown

, et al. (2022) Do as I can, not as I say: grounding language in robotic affordances. ArXiv, abs/2204.01691.

Andreas

Rohrbach

Darrell

, et al. (2015) Deep compositional question answering with neural module networks. CoRR, abs/1511.02799. https://arxiv.org/abs/1511.02799

Andreas

Rohrbach

Darrell

, et al. (2016) Learning to compose neural networks for question answering. CoRR, abs/1601.01705. https://arxiv.org/abs/1601.01705

Azuma

Miyanishi

Kurita

, et al. (2021) ScanQA: 3D question answering for spatial scene understanding. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , New Orleans, LA, USA, 18–24 June 2022, pp. 19107–19117.

Bahdanau

Cho

Bengio

(2014) Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.

Blukis

Knepper

Artzi

(2020) Few-shot object grounding and mapping for natural language robot instruction following. In: Conference on Robot Learning, Cambridge, Massachusetts, USA, November 16–18, 2020.

Brown

Mann

Ryder

, et al. (2020) Language models are few-shot learners. CoRR, abs/2005.14165. https://arxiv.org/abs/2005.14165

Chen

Y-C

, et al. (2019) UNITER: learning universal image-text representations. CoRR, abs/1909.11740. https://arxiv.org/abs/1909.11740

10.

Chen

Chang

Nießner

(2020) ScanRefer: 3D object localization in rgb-d scans using natural language. In: 16th European Conference on Computer Vision (ECCV), Glasgow, United Kingdom, 23–28 August 2020.

11.

Chen

Lin

, et al. (2021) A joint network for grasp detection conditioned on natural language commands. In: 2021 IEEE International Conference on Robotics and Automation (ICRA) , Xi'an, China, 30 May–05 June 2021, pp. 4576–4582.

12.

Chung

Gulcehre

Cho

, et al. (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. https://arxiv.org/abs/1412.3555

13.

Cirik

Berg-Kirkpatrick

Morency

L-P

(2018) Using syntax to ground referring expressions in natural images. CoRR, abs/1805.10547. https://arxiv.org/abs/1805.10547

14.

Das

Datta

Gkioxari

, et al. (2017) Embodied question answering. CoRR, abs/1711.11543. https://arxiv.org/abs/1711.11543

15.

Deng

Dong

Socher

, et al. (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition , Miami, FL, USA, 20–25 June 2009, pp. 248–255. IEEE.

16.

Liu

, et al. (2021) Visual grounding with transformers. CoRR, abs/2105.04281. https://arxiv.org/abs/2105.04281

17.

Goyal

Khot

Summers-Stay

, et al. (2016) Making the V in VQA matter: elevating the role of image understanding in visual question answering. CoRR, abs/1612.00837. https://arxiv.org/abs/1612.00837

18.

Lin

T-Y

Kuo

, et al. (2021) Open-vocabulary object detection via vision and language knowledge distillation, International Conference on Learning Representations, Virtual Event, May 3–7, 2021.

19.

Hatori

Kikuchi

Kobayashi

, et al. (2017) Interactively picking real-world objects with unconstrained spoken language instructions. In: 2018 IEEE International Conference on Robotics and Automation (ICRA) , Brisbane, QLD, Australia, 21–25 May 2018, pp. 3774–3781.

20.

Zhang

Ren

, et al. (2015) Deep residual learning for image recognition. CoRR, abs/1512.03385. https://arxiv.org/abs/1512.03385

21.

Gkioxari

Dolla´r

, et al. (2017) Mask R-CNN. In: Proceedings of the IEEE international conference on computer vision , Venice, Italy, 22–29 October 2017, pp. 2961–2969.

22.

Hendrycks

Gimpel

(2016) Gaussian error linear units (gelus). arXiv: Learning.

23.

Rohrbach

Andreas

, et al. (2016) Modeling relationships in referential expressions with compositional modular networks. CoRR, abs/1611.09978. https://arxiv.org/abs/1611.09978

24.

Jacob

Rohrbach

, et al. (2017a) Learning to reason: end-to-end module networks for visual question answering. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, October 22–29, 2017, pp. 804–813.

25.

Jacob

Rohrbach

, et al. (2017b) Learning to reason: end-to-end module networks for visual question answering. CoRR, abs/1704.05526. https://arxiv.org/abs/1704.05526

26.

Rohrbach

Darrell

, et al. (2019) Language-conditioned graph networks for relational reasoning. CoRR, abs/1905.04405. https://arxiv.org/abs/1905.04405

27.

Huang

Chen

Jia

, et al. (2022a) Multi-view transformer for 3D visual grounding. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, Louisiana, USA, June 19–24, 2022, pp. 15503–15512.

28.

Huang

Xia

Xiao

, et al. (2022b) Inner monologue: embodied reasoning through planning with language models. ArXiv, abs/2207.05608.

29.

Huang

Jiang

Dong

H-W

, et al. (2023a) Instruct2Act: mapping multi-modality instructions to robotic actions with large language model. ArXiv, abs/2305.11176. https://api.semanticscholar.org/CorpusID:258762636

30.

Huang

Wang

Zhang

, et al. (2023b) VoxPoser: composable 3D value maps for robotic manipulation with language models. ArXiv, abs/2307.05973. https://api.semanticscholar.org/CorpusID:259837330

31.

Hudson

Manning

(2018) Compositional attention networks for machine reasoning. ArXiv, abs/1803.03067.

32.

Hudson

Manning

(2019) Learning by abstraction: the neural state machine. CoRR, abs/1907.03950. https://arxiv.org/abs/1907.03950

33.

Jang

Irpan

Khansari

, et al. (2022) BC-Z: zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022.

34.

Jiang

Murphy

, et al. (2019) Language as an abstraction for hierarchical deep reinforcement learning. In Neural Information Processing Systems, Vancouver, Canada, December 8–14, 2019.

35.

Jin

Yong

, et al. (2023) RobotGPT: robot manipulation learning from chatGPT. IEEE Robotics and Automation Letters 9: 2543–2550. https://api.semanticscholar.org/CorpusID:265608813

36.

Johnson

Hariharan

van der Maaten

, et al. (2016) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. CoRR, abs/1612.06890.

37.

Johnson

Hariharan

van der Maaten

, et al. (2017a) Inferring and executing programs for visual reasoning. In: 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, October 22-29, 2017, pages 3008–3017.

38.

Johnson

Hariharan

van der Maaten

, et al. (2017b) Inferring and executing programs for visual reasoning. CoRR, abs/1705.03633. https://arxiv.org/abs/1705.03633

39.

Kalithasan

Singh

Bindal

, et al. (2022) Learning neuro-symbolic programs for language guided robot manipulation. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), London, United Kingdom, May 29 – June 2, 2023, pp. 7973–7980. https://api.semanticscholar.org/CorpusID:253180551

40.

Kazemzadeh

Vicente

Matten

, et al. (2014) ReferitGame: referring to objects in photographs of natural scenes. In: Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, October 25-29 2014.

41.

Kim Sang

Tjong EF

De Meulder

(2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, Alberta, Canada, May 31 – June 1, 2003, pp. 142–147. https://aclanthology.org/W03-0419

42.

Kirillov

Mintun

Ravi

, et al. (2023) Segment anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, October 2–6, 2023, pp. 3992–4003. https://api.semanticscholar.org/CorpusID:257952310

43.

Koenig

Howard

(2004) Design and use paradigms for Gazebo, an opensource multi-robot simulator. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), Sendai, Japan, 28 September–02 October 2004, pp. 2149–2154, vol. 3.

44.

Kuhn

(1995) The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2: 83–97.

45.

Kumra

Joshi

Sahin

(2019) Antipodal robotic grasping using generative residual convolutional neural network. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, Nevada, USA, October 25-29, 2020, pp. 9626–9633. https://api.semanticscholar.org/CorpusID:202558732

46.

Yatskar

Yin

, et al. (2019) VisualBERT: a simple and performant baseline for vision and language. CoRR, abs/1908.03557. https://arxiv.org/abs/1908.03557

47.

Selvaraju

Deepak Gotmare

, et al. (2021) Align before fuse: vision and language representation learning with momentum distillation. In: Neural Information Processing Systems, Virtual Event, December 6–14, 2021.

48.

Liang

Huang

Xia

, et al. (2022a) Code as policies: language model programs for embodied control. ArXiv, abs/2209.07753.

49.

Liang

Huang

Xia

, et al. (2022b) Code as policies: language model programs for embodied control. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), London, United Kingdom, May 29 - June 2, 2023, pp. 9493–9500. https://api.semanticscholar.org/CorpusID:252355542

50.

Lin

T-Y

Maire

Belongie

, et al. (2014) Microsoft COCO: common objects in context. CoRR, abs/1405.0312. https://arxiv.org/abs/1405.0312

51.

Liu

Zhang

Zha

Z-J

, et al. (2018) Explainability by parsing: neural module tree networks for natural language visual grounding. CoRR, abs/1812.03299. https://arxiv.org/abs/1812.03299

52.

Liu

Bai

, et al. (2019) CLEVR-Ref+evr: diagnosing visual reasoning with referring expressions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019, pp. 4180–4189.

53.

Liu

Lin

Han

, et al. (2021) Referit-in-RGBD: a bottom-up approach for 3D visual grounding in RGBD images. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021, pages 6028–6037.

54.

Batra

Devi

, et al. (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. CoRR, abs/1908.02265. https://arxiv.org/abs/1908.02265

55.

Luketina

Nardelli

Farquhar

, et al. (2019) A survey of reinforcement learning informed by natural language, International Joint Conference on Artificial Intelligence, Macao, China, August 10-16, 2019.

56.

Luo

Shakhnarovich

(2017) Comprehension-guided referring expressions. CoRR, abs/1701.03439. https://arxiv.org/abs/1701.03439

57.

Lynch

Sermanet

(2020) Language conditioned imitation learning over unstructured data. Robotics: Science and Systems. XVII.

58.

Mao

Huang

Toshev

, et al. (2015) Generation and comprehension of unambiguous object descriptions. CoRR, abs/1511.02283. https://arxiv.org/abs/1511.02283

59.

Mao

Gan

Kohli

, et al. (2019) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. CoRR, abs/1904.12584. https://arxiv.org/abs/1904.12584

60.

Mao

Lozano-Perez

Joshua

(2023) PDSketch: integrated planning domain programming and learning. ArXiv, abs/2303.05501. https://api.semanticscholar.org/CorpusID:257427056

61.

Misra

Sung

Lee

, et al. (2014) Tell me dave: contextsensitive grounding of natural language to manipulation instructions. The International Journal of Robotics Research 35: 281–300.

62.

Oliveira

Lopes

Lim

, et al. (2016) 3D object perception and perceptual learning in the race project. Robotics and Autonomous Systems 75: 614.

63.

OpenAI (2023) GPT-4 technical report.

64.

Pennington

Socher

Manning

(2014) GloVe: Global vectors for word representation. In EMNLP.

65.

Perez

Strub

de Vries

, et al. (2017) FiLM: visual reasoning with a general conditioning layer. CoRR, abs/1709.07871. https://arxiv.org/abs/1709.07871

66.

Plummer

Wang

Cervantes

, et al. (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision 123: 74–93.

67.

Radford

Kim

Hallacy

, et al. (2021) Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020. https://arxiv.org/abs/2103.00020

68.

Rohrbach

, et al. (2015) Grounding of textual phrases in images by reconstruction. CoRR, abs/1511.03745. https://arxiv.org/abs/1511.03745

69.

Sadhu

Chen

Nevatia

(2019) Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea (South), 2019, pp. 4694–4703.

70.

Sanh

Debut

Chaumond

, et al. (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

71.

Santoro

Raposo

Barrett

DGT

, et al. (2017) A simple neural network module for relational reasoning. CoRR, abs/1706.01427. https://arxiv.org/abs/1706.01427

72.

Schwenk

Khandelwal

Clark

, et al. (2022) A benchmark for visual question answering using world knowledge. https://arxiv.org/abs/2206.01718

73.

Shridhar

Hsu

(2018) Interactive visual grounding of referring expressions for human-robot interaction. ArXiv, abs/1806.03831.

74.

Shridhar

Manuelli

Fox

(2021) CLIPort: what and where pathways for robotic manipulation. ArXiv, abs/2109.12098.

75.

Song

Zeng

, et al. (2016) Semantic scene completion from a single depth image. CoRR, abs/1611.08974. https://arxiv.org/abs/1611.08974

76.

Song

Salcianu

Song

, et al. (2021) Fast WordPiece tokenization. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.

77.

Stepputtis

Campbell

Phielipp

, et al. (2020) Language-conditioned imitation learning for robot manipulation tasks. ArXiv, abs/2010.12083.

78.

Subramanian

Merrill

Darrell

, et al. (2022) A strong zero-shot baseline for referring expression comprehension. ArXiv, abs/2204.05991.

79.

Suchi

Patten

Fischinger

, et al. (2019) Easylabel: a semi-automatic pixel-wise object annotation tool for creating robotic RGB-D datasets. In: International Conference on Robotics and Automation, ICRA 2019, Montreal, QC, Canada, 20–24 May 2019, pp. 6678–6684.

80.

Tziafas

Kasaei

(2022) Sim-to-real transfer of visual grounding for human-aided ambiguity resolution. In: Chandar

Pascanu

Precup

(eds), Proceedings of The 1st Conference on Lifelong Learning Agents, Volume 199 of Proceedings of Machine Learning Research, Montreal, QC, Canada, 22–24 August 2022, pp. 1213–1230. https://proceedings.mlr.press/v199/tziafas22a.html

81.

Tziafas

Kasaei

(2024) Towards open-world grasping with large vision-language models. ArXiv, abs/2406.18722. https://api.semanticscholar.org/CorpusID:270764332

82.

Tziafas

Goel

, et al. (2023) Language-guided robot grasping: clip-based referring grasp synthesis in clutter. In: Conference on Robot Learning, Atlanta, Georgia, USA, November 6–9, 2023. https://api.semanticscholar.org/CorpusID:265128518

83.

Wang

Cao

, et al. (2018) Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. CoRR, abs/1812.04794. https://arxiv.org/abs/1812.04794

84.

Wang

, et al. (2021) Interpretable visual reasoning via induced symbolic space. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , Montreal, QC, Canada, 10–17 October 2021, pp. 1878–1887.

85.

Wang

Mao

Hsu

, et al. (2023) Programmatically grounded, compositionally generalizable robotic manipulation. ArXiv, abs/2304.13826. https://api.semanticscholar.org/CorpusID:258352681

86.

Williams

(1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8: 229–256.

87.

Yang

Gao

, et al. (2015) Stacked attention networks for image question answering. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , Las Vegas, NV, USA, 27–30 June 2016, pp. 21–29.

88.

Yang

(2019a) Dynamic graph attention for referring expression comprehension. CoRR, abs/1909.08164. https://arxiv.org/abs/1909.08164

89.

Yang

(2019b) Cross-modal relationship inference for grounding referring expressions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , Long Beach, CA, USA, 15–20 June 2019.

90.

Yang

Gong

Wang

, et al. (2019c) A fast and accurate one-stage approach to visual grounding. CoRR, abs/1908.06354. https://arxiv.org/abs/1908.06354

91.

Yang

Zhang

, et al. (2023) Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V. ArXiv, abs/2310.11441. https://api.semanticscholar.org/CorpusID:266149987

92.

Gan

, et al. (2018) Neural-symbolic VQA: disentangling reasoning from vision and language understanding. CoRR, abs/1810.02338. https://arxiv.org/abs/1810.02338.

93.

Poirson

Yang

, et al. (2016a) Modeling context in referring expressions. ArXiv, abs/1608.00272.

94.

Poirson

Yang

, et al. (2016b) Modeling context in referring expressions. CoRR, abs/1608.00272. https://arxiv.org/abs/1608.00272

95.

Lin

Shen

, et al. (2018) MAttNet: modular attention network for referring expression comprehension. CoRR, abs/1801.08186. https://arxiv.org/abs/1801.08186

96.

Tang

Yin

, et al. (2020) ERNIE-ViL: knowledge enhanced vision-language representations through scene graph. In: AAAI Conference on Artificial Intelligence, New York, New York, USA, February 7–12, 2020.

97.

Zeng

Florence

Tompson

, et al. (2020) Transporter networks: rearranging the visual world for robotic manipulation. In: Conference on Robot Learning, Cambridge, Massachusetts, USA, November 16-18, 2020.

98.

Zeng

Wong

Welker

, et al. (2022a) Socratic models: composing zero-shot multimodal reasoning with language. ArXiv, abs/2204.00598.

99.

Zeng

Wong

Welker

, et al. (2022b) Socratic models: composing zero-shot multimodal reasoning with language. ArXiv, abs/2204.00598. https://api.semanticscholar.org/CorpusID:247922520

100.

Zhao

Cai

, et al. (2021) 3DVG-transformer: relation modeling for visual grounding on point clouds. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV) , Montreal, QC, Canada, 10–17 October 2021, pp. 2908–2917.

Enhancing interpretability and interactivity in robot manipulation: A neurosymbolic approach

Abstract

Keywords

Introduction

Related works

Grounding referring expressions

Neurosymbolic reasoning

Language-guided manipulation

Methodology

Scene encoder

Visual encoder

Grasp synthesis

Relation encoder

Language parser

Concept tagger

Seq2Seq encoder-decoder

Tag-conditioned attention LSA

Concept grounding

Visual grounders

Spatial grounders

Primitives and program execution

Primitives library

Program executor

Training paradigm

Experiments

Datasets

SynHOTS

HOTS

VQA evaluation in simulation

Setup

Accuracy

Sample-efficiency

Generalization to unseen vocabulary

Adapting to real scenes

Interpretable interactive object grasping

Extending to more manipulation tasks

Comparisons with foundation models

Semantic parsing

Visual grounding

Interactive object grasping

Discussion

Adapting to novel content

Handling failure via interactivity

Training time, real-time performance, and dynamic environments

Portability

Conclusion

Footnotes

Acknowledgments

ORCID iDs

Funding

Declaration of conflicting interests

Supplemental material

Note

Appendix

References