Sage Journals: Discover world-class research

Abstract

Photorealistic synthetic data and novel rendering techniques significantly advanced computer vision research. However, datasets focused on computer vision applications cannot be easily applied to robotics because they typically lack physics-related information. This, combined with the difficulties of realistically simulating dynamic worlds and the insufficient photorealism, flexibility, and control options of common robotics simulation frameworks, hinders progress in (visual-)perception research for autonomous robotics. For instance, most Visual Simultaneous Localization and Mapping methods are passive, developed under a (semi-)static environment assumption, and evaluated on just a limited number of pre-recorded datasets. To address these challenges, we present a highly customizable framework built upon NVIDIA Isaac Sim for Generating Realistic and Dynamic Environments—GRADE. GRADE leverages Isaac’s rendering capabilities, physics engine, and low-level APIs to populate and manage realistic simulations, generate synthetic data, and evaluate online and offline robotics approaches, including Active SLAM and heterogeneous multi-robot scenarios. Within GRADE, we introduce a novel experiment repetition approach that allows environmental and scenario variations of previous simulations within physics-enabled environments, enabling flexible and continuous testing, development, and data generation. We then use GRADE to collect a high-fidelity and richly annotated synthetic video dataset of indoor dynamic environments. With that, we train detection and segmentation models for humans and successfully address the syn-to-real gap. We then benchmark state-of-the-art dynamic V-SLAM algorithms, revealing their limitations in tracking times and generalization capabilities, and evidencing that top-performing deep learning models do not necessarily lead to the best SLAM performance. Code and data are provided as open-source at https://grade.is.tue.mpg.de.

Keywords

Robot simulation active robotics synthetic data syn-to-real learning scene understanding dynamic SLAM

1. Introduction

Directly conducting robotic experiments in the real world to test and validate new approaches can pose safety risks. Unforeseen failures of methods and sensors, corner cases, or loss of control of the autonomous robot platform may easily lead to damage or injuries. This problem is further exacerbated when the robot relies on exteroceptive sensors: noise, domain shifts, and the lack of formal performance guarantees or uncertainty quantification in most deep learning (DL) models can make behaviors unpredictable. Surely, pre-recorded datasets have been widely used to develop and evaluate new approaches. Those designed for computer vision research, such as Lin et al. (2014); Saini et al. (2022); Varol et al. (2017), are visually appealing due to the use of real-world images or advanced rendering engines like Blender¹ and Unreal Engine (UE²). However, the absence of basic sensor readings (e.g., IMU, LiDAR), sensor states (e.g., position, orientation, velocity), and (in general) temporal information restricts their applicability in robotics contexts, where physics- and time-related information is necessary.

At the same time, gathering ground-truth data for robotics poses significant challenges. In addition to the safety risks, accurately measuring physical quantities can be intricate, time-intensive, and impractical. Even when feasible, it demands costly specialized sensors requiring rigorous calibration and synchronization, both among themselves and with other hardware like cameras, making the process particularly difficult. For example, despite the centrality of the problem to higher-level tasks, there exist only a handful of real-world SLAM benchmark datasets with ground-truth information, especially for dynamic environments (Bujanca et al., 2021; Burri et al., 2016; Geiger et al., 2013; Sturm et al., 2012). Furthermore, relying solely on already available datasets for real-world robotic applications is not straightforward due to differences in robot form factors (e.g., sensor placement), sensor configurations (e.g., camera focal length, sensor publishing frequency), and different noise models. This requires researchers to rely on their own data or a limited selection of datasets, which can overfit specific scenarios and hinder reproducibility, robustness, and broader deployment. Finally, datasets are “fixed” in time, as one cannot introduce new sensors or modify recorded environmental conditions (e.g., removing dynamic elements or changing lighting conditions) after data collection. This limitation further restricts their usage, making them inadequate for evaluating methods that require real-time dynamic decision-making, such as obstacle avoidance, environment interaction, or Active SLAM.

Therefore, to overcome the static nature of pre-recorded datasets and facilitate the safe development and evaluation of robotics methods, simulation engines such as Gazebo (Koenig and Howard, 2004) and WeBots (Michel, 2004) are widely used. However, with those, it is often challenging to (i) obtain and finely control realistic animated rigid and non-rigid assets, (ii) simulate dynamic environments, (iii) customize and control the simulation engines, and thus (iv) bridge the gap between simulations and the real world. Moreover, the low visual fidelity of many robotics simulators exacerbates the sim-to-real transfer gap. As a result, most robotics research is conducted in highly controlled scenarios with several simplifying assumptions. Indeed, although (non-)rigid moving objects are common in real life and significantly affect vision-based localization or navigation methods, many current approaches still assume a (semi-)static world (Abaspur Kazerouni et al., 2022; Bujanca et al., 2021; Saputra et al., 2018) or use simplified dynamic environments composed of basic 3D shapes (Shao et al., 2024; Wu et al., 2024). Overall, this is detrimental to the development and evaluation of robotic systems which depend heavily on visual perception to operate reliably in dynamic environments. Thus, it is imperative to have a simulation framework that incorporates at least the following key characteristics: (i) physical realism—to correctly simulate dynamics, (ii) photorealism—to reduce the perception gap, (iii) low-level access—to allow full control, and (iv) the capability to simulate dynamic entities—to enable widespread deployment. Integration with ROS, although optional, would further encourage wider adoption of such a framework, given its common use in developing higher-level software that functions simultaneously in both simulation and real robots. In short, an easily controllable simulation that closely resembles the real world with a minimal sim-to-real gap is essential to enable quick and reliable real-world deployment of robotic methods.

To address these issues, we present a solution for Generating Realistic And Dynamic Environments—GRADE. GRADE is a flexible, controllable, customizable, photorealistic, ROS-integrated pipeline that can produce visually realistic data in physically enabled environments. GRADE is built directly upon NVIDIA Isaac Sim,³ leveraging its rendering and PhysX engines. In contrast to existing methods, our work is not merely a new benchmarking approach or an application-specific platform; instead, it provides an open system that can be easily expanded towards different research goals. We make available a set of functions, tools, and case studies that serve as an entry point with low-level access to Isaac Sim’s capabilities. This enables researchers to easily customize simulations to meet their needs and further bridge the gap between simulation and real-world scenarios. A sample image generated with GRADE, displaying diverse subjects, environments, and overlaid robots, is shown in Figure 1.

Figure 1.

Example scenes generated by GRADE with overlaid drones. This figure shows two different scenarios simulated with GRADE. [Left] An outdoor savanna environment with manually placed animated zebras (the same scene is also present in Figure 4e). The savanna world and the animated zebras are freely available assets we obtained from the Unreal Engine and SketchFab marketplaces. This scene is used only to evaluate the generalization of the method to different environments and assets. [Right] An indoor environment with animated humans (also present in Figures 4b-4d Figures 6 Figures 8). The UAVs and the UGV (in the right image only) are captured in the scene itself from an external point of view and then manually overlaid to highlight their movement in the environment (similar to Figure 9a and 9b). The 3D-Front indoor environment (on the right), populated with animated SMPL humans from the Cloth3D dataset, resembles a scene that we used to generate the data used for training our syn-to-real detection and segmentation models in Section 6.2 and to evaluate Dynamic V-SLAM approaches in Section 6.3. This indoor scene has been automatically created using our data generation procedure, including the placement of the dynamic assets.

This work presents different case studies highlighting GRADE’s versatility, including visual data collection (Section 4.1), Active V-SLAM, and heterogeneous multi-robot simulations (Section 4.2). Among those, we also introduce a novel experiment repetition procedure (Section 4.3) that enables the exact reproduction of simulation trials under varying environmental conditions and adjustments to the robot’s settings and equipment, all within a physically controlled environment. We then use GRADE to automatically generate a novel extensive dataset, which we release publicly, collected in indoor dynamic environments (Section 5). We employ this dataset, consisting of more than 615K frames, to assess the visual realism of the simulation through extensive experiments on human detection and segmentation with YOLOv5 (Jocher et al., 2022) and Mask R-CNN (He et al., 2017), demonstrating strong sim-to-real performance (Section 6.2). Indeed, our results highlight that pre-training with GRADE-generated data enables models to outperform the baseline on the COCO (Lin et al., 2014) dataset. Moreover, training with synthetic images alone achieves results comparable to the baseline, even without any fine-tuning, on the TUM RGB-D (Sturm et al., 2012) dynamic sequences. Using GRADE, we also extensively benchmark state-of-the-art indoor Dynamic V-SLAM algorithms (Section 6.3). Our evaluations provide evidence of their limited generalization capabilities, as they fail to either accurately or completely track trajectories. Additionally, contrary to common belief, we show that using the best-performing deep learning model does not always yield the best results in Dynamic V-SLAM scenarios. All our source code and generated data are made freely available to the community, thanks to our choice of using only freely available assets as the foundational components of our work.

The rest of this paper is organized as follows: in Section 2, we review the related work about (i) robotics simulators, (ii) indoor environments datasets, and (iii) simulated animated humans. In Section 3, we introduce and detail the main components of the proposed framework, GRADE, in four main aspects: (i) asset preparation, (ii) robot creation and control, (iii) simulation management, and (iv) post-processing tools. Section 4 is then dedicated to exemplifying four different case studies implemented with GRADE. In there, we also introduce our novel experiment repetition approach. Following that, in Section 5 we provide details of the data generation procedure and the datasets released with this work. The description and analysis of our experiments and results are reported in Section 6. These include syn-to-real learning performance using the generated data and our evaluations of state-of-the-art Dynamic V-SLAM methods using both synthetic and real data. Finally, we report our conclusions and final remarks in Section 7.

2. Related works

Here, we present the state-of-the-art of robotics simulators (Section 2.1) and of the main components of our data generation procedure, namely, indoor environments datasets (Section 2.2) and simulated animated humans (Section 2.3).

2.1. Robotics simulators

Gazebo (Farley et al., 2022; Santos Pessoa De Melo et al., 2019) is one of the go-to choices thanks to its simplicity, reliable physics engine, and ROS (Quigley et al., 2009) integration. However, it lacks photorealism and full simulation control, supports only a limited range of assets and worlds, and struggles to deliver real-time performance, even for single robots in simple worlds with minimal rendering necessities (Abbyasov et al., 2020; Noori et al., 2017; Platt and Ricks, 2022). For example, Gazebo has been used in the context of bridging the perception gap between real and simulated environments by Bayraktar et al. (2018). They introduced ADORESet, a hybrid image dataset. The combination of real and synthetic images in ADORESet aims to improve the robustness of computer vision systems by leveraging the strengths of both data types. However, contrary to what we will show in our results (Section 6.2), their data does not generalize to the real world as shown in Table 6 of their paper. This is likely due to the low realism of Gazebo. Alternatives such as BenchBot (Talbot et al., 2020), AirSim (Shah et al., 2017), Ai2Thor (Kolve et al., 2017), iGibson (Shen et al., 2021), AI-Habitat (Savva et al., 2019), and Sim4CV (Müller et al., 2018) all lack essential features such as low-level simulation controllability, ROS integration, or realistic physics and visual fidelity. Additionally, some simulators model only rigid objects (Koenig and Howard, 2004; Manolis Savva* et al., 2019) or do not include dynamic assets, as this would introduce challenges for their correct placement, management, or generation. Finally, computer-vision-focused simulators like Sim4CV or Kubric (Greff et al., 2022) are difficult to adapt for robot simulations, as they lack many robotics-specific sensors, accurate physics simulations, and support for robotics platforms and ROS.

Among robotic simulators, AirSim seeks to bridge the visual realism gap by building on top of Unreal Engine. However, it provides limited APIs, lacks support for custom or multiple heterogeneous robots, and does not enable direct joint control. Its native integration with ROS is also loose and incomplete.⁴^–⁶ Ai2Thor, designed primarily for AI and visual tasks, is not customizable for general robotics purposes, as it lacks essential sensor interfaces, such as IMUs and LiDARs,⁷^,⁸ and does not offer native ROS support.⁹ Similarly, AI-Habitat is mainly focused on navigation tasks. Although a community plug-in exists for ROS integration (Chen et al., 2022), it is an external package with limited support rather than a core feature, for example, currently it does not support ROS2 or Habitat2.0.¹⁰

Notably, Habitat3.0 (Puig et al., 2023), which was released recently and concurrently with this work, allows the integration of only human animations based on the SMPL-X model (Pavlakos et al., 2019), resulting in a much narrower scope compared to our approach. Moreover, both Habitat2.0 (Szot et al., 2021) and 3.0 support only rigid objects, limiting the realism of the simulation, a problem not present in Isaac Sim. iGibson (Shen et al., 2021) focuses instead on interactive environments with enhanced characteristics like temperature or wetness. However, its initial version was not customizable or expandable to different tasks, such as navigation and localization, and offered limited visual realism. To solve the realism issue, it was recently ported to Isaac Sim as OmniGibson (Li et al., 2022). Their motivations for using Isaac Sim align closely with ours, that is, recognizing the limitations of other simulation frameworks, leveraging Isaac’s physical and visual realism, and allowing for higher flexibility. However, different from GRADE, OmniGibson primarily functions as a plug-in to add characteristics such as temperature and dirtiness control focusing on indoor activities, without any simulated dynamic humans,¹¹ and it was published at the same time our work was first released.

Recent simulation platforms have also introduced novel approaches to robotic learning and environment modeling. Genesis,¹² a newly proposed physics engine, aims to support general-purpose robotics, embodied AI, and physical AI applications. While details remain unpublished, it appears to offer diverse material simulation, a lightweight robotics environment, and high-fidelity rendering. If integrated with ROS, it could rival Isaac Sim and GRADE. Similarly, RoboGen (Wang et al., 2024b) employs foundation and generative models in a propose-generate-learn cycle for autonomous skill acquisition but lacks ROS support and focuses on RL and manipulation tasks. Therefore, a more direct comparison for RoboGen would be Isaac Lab.¹³ (formerly Isaac Gym¹⁴) which is specifically designed for training RL and robotic tasks.

Finally, BenchBot (Talbot et al., 2020) and its extension BEAR (Hall et al., 2022) are two solutions aimed at introducing a procedural way to benchmark (active) SLAM methods using Isaac Sim. However, they do not include any dynamic assets natively and, as a benchmark suite, are a closed system by nature. They also employ fixed control policies that could be unrealistic for most robots (e.g., 1 cm and 1°goal position accuracies). Then, due to their limited scope and the additional APIs layers between the user and the simulator itself, they lack the desirable customization possibilities, while being quite limited in their scope. For example, it is hard to integrate already-developed methods to control robots, or adapt the system to different platforms or tasks, as they only provide a limited set of predefined actions (i.e., move_[next,angle,distance]).

In contrast to previous approaches, GRADE supports multiple robots, Software-In-the-Loop (SIL), and ROS packages, and offers a customizable system where tools, settings, and simulation runs can be personalized to meet the specific needs of the researcher at the same time. By leveraging Isaac Sim, GRADE provides a highly flexible simulation system where various components can be adjusted, modified, or redefined. Its core strength lies in a modular architecture allowing researchers to customize the simulation pipeline to fit their specific needs. Unlike simulators focused on benchmarking particular approaches like Active SLAM, GRADE supports diverse experiments, including those that bypass the physics engine for efficient photorealistic data generation. It also introduces an independent automatic procedural asset placement system that can be replaced as needed and is not restricted to specific robots or perception systems. By exposing low-level functionalities, GRADE enables a degree of customization that is difficult to achieve in many existing frameworks. This ensures that researchers can modify and personalize their pipeline—from scene generation to control execution—making it a powerful tool for robotics and computer vision research.

2.2. Indoor environments datasets

There are two main ways in which we can represent indoor scenes within a simulation environment (Roldão et al., 2022): scans of real-world environments or posed meshed objects. Using scans of the real-world, for example, HM3D (Ramakrishnan et al., 2021), Matterport3D (Chang et al., 2017), Gibson Env (Xia et al., 2018), SceneNN (Hua et al., 2016), Replica (Straub et al., 2019), or Structured3D (Zheng et al., 2020), poses several issues. First, those are non-interactive environments in which all the objects are non-movable. Second, any new object or asset placed within the environment will not be lit correctly and will not realistically affect the scene with shadows or reflections. Finally, many of these present various artifacts, for example, unrealistic-looking objects, holes in the reconstruction due to reflective surfaces or unmapped areas, and uneven surfaces (Roldão et al., 2022).

Using worlds based on meshed 3D assets addresses these problems while also allowing randomization (e.g., on textures, and object placement) and, eventually, interaction. However, datasets based on those, like ML-Hypersim (Roberts et al., 2021) and InteriorNet (Li et al., 2018), usually rely on non-freely available elements and only release rendered images, making them unusable for our purposes. These factors limit their adoption, reproduction, and expansion. Furthermore, the InteriorNet simulator has not been made available, while HyperSim’s engine is not physics-based and its sequences relate only to very short trajectories (just 100 frames). ProcThor (Deitke et al., 2022) is a recently developed framework to procedurally generate environments. However, it is limited in the quality of the assets and usable only within the Ai2THOR suite, which is focused on visual AI rather than robotics and offers no ROS support. OpenRooms (Li et al., 2021b) has not yet released any assets or CAD models publicly. 3D-Front (Fu et al., 2021a, 2021b) is a large publicly available dataset with meshed, professionally designed, and semantically annotated room layouts. This is, by far, the largest dataset available nowadays based on meshes that can be adopted. However, the annotations are not perfect and objects sometimes co-penetrate each other (Khanna et al., 2024). Finally, HSSD (Khanna et al., 2024) is a synthetic Matterport-like dataset of indoor scenes. While this is a viable alternative to 3D-Front, it is still much smaller and does not provide light sources. The five environments released with BEAR (Hall et al., 2022) in five variations each are only slight modifications of worlds commercially available from Evermotion. Finally, we must mention that recently a mesh-based generation strategy for indoor environments, Infinigen Indoor (Raistrick et al., 2024), has been released. As it is already usable with Omniverse, it can be easily integrated into GRADE, allowing it to scale the automatic testing and data generation considerably by removing the necessity of limited mesh-based datasets.

Commercial solutions such as ArchVizPRO¹⁵ and Evermotion¹⁶ offer high-quality assets but are not freely available. In GRADE, we adopt 3D-Front for our simulations due to its accessibility, large variability, and mesh-based nature, which eliminates lighting inconsistencies. As discussed in Section 3.1.1, beyond seamlessly integrating them with Isaac Sim in our data generation procedure and in our custom automatic 3D-based asset placement strategy (Section 3.1.4), we further enhance these environments with randomized textures and lighting conditions and partially refine the semantic mapping during conversion.

2.3. Simulated animated humans

Most dynamic content in indoor scenes comes from human movement. In V-SLAM and autonomous robotics, handling dynamic elements is crucial, as they disrupt key processes (Bescos et al., 2018; Liu et al., 2022; Xu et al., 2025) (e.g., loop closures, visual odometry) or necessitate the implementation of additional techniques (Wang et al., 2023) (e.g., dynamic obstacle avoidance). These challenges are often addressed using DL methods for detection, segmentation, and motion prediction (Bescos et al., 2018; Wang et al., 2020, 2023), which require large-scale ground-truth datasets.

Collecting real-world GT human motion data is limited to controlled setups like Vicon Halls or motion capture (MoCap) systems (Mahmood et al., 2019), which use multi-camera marker-based tracking for high-precision joint estimation. The humans can then be represented virtually as parametric 3D human models that can be used for different tasks like 3D human reconstruction (Saito et al., 2019; Xiu et al., 2023) and pose estimation (Li et al., 2021a; Saini et al., 2022), even from single images. Those models can also provide fine control over pose, shape, and motion. Commonly used human body models include SCAPE (Anguelov et al., 2005), GHUM (Xu et al., 2020), and the SMPL series (Loper et al., 2015; Pavlakos et al., 2019; Romero et al., 2017). SMPL is arguably the most widely adopted thanks to its flexibility, capability to model hands and facial expressions, and compatibility with various simulation engines.

Unfortunately, MoCap and Vicon systems support only a narrow range of subjects, clothing, and scenarios. To overcome these constraints, synthetic data has become increasingly popular, leveraging parametric 3D human models rendered in engines like Blender and Unreal Engine (Black et al., 2023; Saini et al., 2022). However, most of them have significant limitations. Many are generated by compositing human figures onto static backgrounds (Black et al., 2023; Ebadi et al., 2022) or capturing single-frame images rather than full video sequences (Saini et al., 2022; Yang et al., 2023), often leading to artifacts such as floating humans and incoherent placements (Ebadi et al., 2022; Pumarola et al., 2019). Clothing representation is another challenge, as most SMPL-based datasets lack 3D clothing (Varol et al., 2017) or rely on explicit point-cloud-based models (Heming et al., 2020; Su et al., 2023; Wang et al., 2024a), which are difficult to edit and integrate into simulations. This is partly due to their use in rendering engines without physics simulation and the predominant research focus on human shape over clothing dynamics. Commercial solutions like RenderPeople²³ or CLO²⁴ offer realistic clothed human models, but freely available datasets with animated clothed SMPL humans remain rare. Notable exceptions include Cloth3D (Bertiche et al., 2020) and BEDLAM (Black et al., 2023).

As discussed in Section 1, existing datasets are often inadequate for robotics. They lack key information needed for autonomous systems, such as camera states, IMU readings, scene depth, and point-cloud data. Additionally, they are non-interactive, limiting their applicability in developing autonomous systems that must respond to human movement in real-time. Advancing robotics research in dynamic environments requires integrating animated human assets into realistic, robot-focused simulations, particularly for tasks like obstacle avoidance and Active V-SLAM.

In GRADE, we use Cloth3D and AMASS as primary sources of animated human models for indoor dynamic environments. Cloth3D provides diverse, physically plausible clothing deformations, while AMASS offers a large corpus of high-quality human motion sequences. To integrate them into Isaac Sim, we developed a custom SMPL converter, detailed in Section 3.1.3. Note that Isaac Sim also supports realistic physics-based clothing simulation, but we do not use this capability in this work.

3. Materials and methods

Following the logical structure depicted in Figure 2, we outline here the key components of GRADE: asset preparation and placement (Section 3.1), robot creation and control (Section 3.2), simulation management (Section 3.3), and post-processing tools (Section 3.4).

Figure 2.

Recap of the main components of the GRADE framework. With a blue background, we highlight the software developed within this work’s scope and reference the specific repository in the footnotes.

3.1. (Non-)Rigid assets preparation and placement

Isaac Sim relies on the Universal Scene Description (USD) format, thus requiring the conversion of assets into USD files before their integration into the simulations. This process poses several challenges, as many objects or complex animated models cannot be easily or directly converted. Nonetheless, adopting the USD format significantly enhances the overall flexibility of the system, enabling the inclusion of assets from various sources, including UE, Blender, AutoCAD, Maya, and more, through Omniverse Connectors²⁵ and custom software—a common limitation for other simulation tools, such as Gazebo or Habitat.

3.1.1. Environments

First, we prepare the assets that will be the robot’s working environments within our simulation. Although Isaac Sim provides a converter for common OBJ and FBX files to the USD format, this process often fails with complex models and hierarchies. Moreover, additional processing, like incorporating semantic classes, exporting supplementary data, or adjusting the scale of the environments, is generally desirable. Therefore, we customize BlenderProc (Denninger et al., 2023) to enable a reliable conversion and preparation of our environments to the USD format. Specifically, we modify its 3D-Front processor by fixing the texture generation, scaling of the assets, merging geometries, and refining the semantic mapping procedure adopted specifically for the 3D-Front dataset (Fu et al., 2021a) (our main source of environments) to partially correct its wrong mappings.²⁶ We then export the environment in USD, STL, and x3d formats. The USD file is loaded into Isaac Sim during our simulations, while the x3d file can be easily converted into an octomap (Hornung et al., 2013) for subsequent evaluations. Additionally, we also compute the enclosing rectangle and an approximated non-convex polygon that, along with the STL file, are used during the assets placement procedure (see Section 3.1.4).

3.1.2. Objects

In addition to the environment itself, we aim to have the ability to include additional objects within the simulation to increase its diversity and complexity. To achieve this, we adapt the standard converter to be able to dynamically download and load various objects at runtime from datasets such as Google Scanned Objects (Downs et al., 2022) (GSO) and ShapeNet (Chang et al., 2015). These objects can then be placed randomly or at predefined locations, or animated as flying entities through random, non-physics-enabled transformations.

3.1.3. Pre-animated assets

The Omniverse connector that converts assets from Blender to the USD format works well for simple animated assets. However, similar to the challenges faced with environment conversion, it fails when handling complex objects like SMPL-based animations. As mentioned in Section 2.3, SMPL fittings are one of the most widespread models used to represent and control human pose and shape in simulations. Therefore, to incorporate animated humans into our experiments with GRADE, we introduce a new software tool based on Blender to automatically convert (clothed) SMPL animated sequences to USD format, for example, from the Cloth3D dataset (Bertiche et al., 2020). This tool allows us to correctly process and load various pre-animated human assets into GRADE, performing different pre-recorded motions as non-rigid entities, either through deformable meshes or subsequent skeletal transformations. Additionally, we generate the STL of the 3D trace of the animation. The STL files store the evolution of the 3D surface geometry occupied by the animated humans and their clothing throughout their entire animated sequences. By combining these with the STL representation of the environment (exported in the previous step), we construct a unified geometric representation that enables precise collision detection. This is thus fundamental to our automatic placement procedure, described in the next section, as it allows us to efficiently evaluate candidate positions and orientations of the animated assets.

3.1.4. Asset placement

Given an environment and a set of (animated) assets, we should avoid physical overlap among them. To this end, different strategies can be employed, such as hardcoding or manually setting their locations in advance (see Section 4.1). Still, to achieve randomized data generation and seamless testing across diversified scenarios, the placement procedure must be automated. However, simple 2D occupancy projections are not a good approach in our case. As we use clothing animations and diversified actions (e.g., jumping, dancing) the projected footprints, usually approximated with rectangles, can be significantly larger than the human model itself. However, their overlap with the footprints of the objects in the environment would not necessarily indicate a real collision. For instance, an animation where the arm passes over a table would indicate a collision when using a 2D occupancy map of the world, even though they are not actually colliding with each other.

Therefore, with GRADE, we introduce a custom placement strategy specifically tailored for (clothed) animated humans, which we use in our data generation procedure (see Section 5). The pseudocode of our approach is provided in Algorithm 1.

First, we load the STL files containing the 3D trace information of human and clothing animations along with the environment. The placement procedure is attempted up to 10 times per asset; if a valid position is not found, the asset is discarded (Algorithm 1, line 8). Each trial begins by selecting a candidate position and orientation for the asset origin. To ensure broad coverage and variation across experiments, the position is chosen randomly, uniformly distributed over the floorplan or enclosing rectangle (Algorithm 1, lines 9−13). The yaw orientation is instead uniformly chosen between 0 and 360°. We then check for intersections between the candidate asset, the environment, and any previously placed assets using the information we saved in the STL files. To balance realism and feasibility, we use an empirically determined collision threshold of 200 intersection points between the asset we are trying to place and any other mesh in the environment (Algorithm 1, line 15). Assets with fewer intersections remain in the scene to prevent unnecessary rejection of minor overlaps, such as slight penetrations with plant leaves or clothing. If the number of intersections exceeds this threshold, the placement attempt fails, and the loop repeats. Otherwise, after a successful placement, the environment updates dynamically to account for newly added assets. We implement this through a custom MoveIt interface, leveraging its integration with the Flexible Collision Library (FCL) for checking collisions between meshes. Notably, GRADE’s modular and flexible design enables seamless integration of alternative placement strategies when desired.

Algorithm 1: Pseudocode of our custom placement procedure.

3.2. Robot creation and control

3.2.1. Creation

Theoretically, custom robots can be loaded into Isaac Sim through the integrated URDF format converter. However, this does not work correctly for our robots—a three-wheeled omnidirectional robot and a flying drone—due to incorrect scaling factors, missing parts and joints, and improperly placed or absent sensors. To address these limitations, we construct our robot platforms directly within Isaac Sim by adding revolute and translational joints to mesh objects and saving them as single USD files. Joint configurations (e.g., limits, maximum speed) and sensor specifications (e.g., type, settings) can be predefined in the USD model, similarly to URDF files, or loaded and modified dynamically during the simulation. In GRADE, we implement the latter approach, enabling greater flexibility compared to URDF definitions or USD pre-configurations. We load and configure our robots entirely at runtime through simple Python code, including attributes such as number, type, and position of sensors, joint stiffness and physical responses, robot’s weight, and desired ROS topic names. These configurations can be set at the start of a simulation or adjusted dynamically during the run itself. This workflow simplifies the setup of custom robots and experiments, increasing control and flexibility of the simulations compared to previous approaches. For instance, it enables the seamless simulation of multiple heterogeneous robotic platforms, as shown in Section 4.2.

3.2.2. Control

Isaac Sim provides only a few default approaches to control assets and robot movements. However, using teleportation or rigid and non-physics-based transformations is inadequate for simulating realistic robot motions and collecting useful sensor data. Nonetheless, these methods can be viable when physics information is unnecessary, such as when collecting only visual data (see Section 4.1). Additionally, Isaac Sim allows direct joint control through low-level APIs via position or velocity setpoints, as applied in our experiment repetition procedure (see Section 4.3). However, this requires pre-configured waypoints and does not support pre-developed Software-In-the-Loop (SIL) control frameworks commonly integrated through ROS and Gazebo. Alternatively, Isaac Sim includes built-in motion and control models for a few specific platforms—an ineffective approach when working with custom robots that are not natively supported. For example, BenchBot (Talbot et al., 2020) relies on this feature to simulate ground robots, while also offering only a narrow set of predefined commands, further constraining its flexibility and generality. Moreover, Isaac Sim lacks support for fluid-dynamic physics, which is necessary for simulating UAVs and, similarly to Gazebo, it does not model frictionless perpendicular translation movements required for omnidirectional wheels (Bonetto et al., 2022). The recent PegasusSimulator (Jacinto et al., 2023) addresses PX4 UAV control by directly applying force to the drone mesh, thus still without simulating actual fluid dynamics.

Our approach differs as we seek to allow for custom robot simulation and control by employing a PID-based joint-level controller to manage robot movements. We leverage the ROS communication system and joint definitions, receiving position or velocity setpoints from other software, such as (N)MPC or Active SLAM frameworks, and convert them into low-level commands. The simulation software then processes those and translates them into robot movements. The ROS communication system is crucial for seamless integration with the Isaac Sim framework, allowing us to assign velocity and position setpoints to each joint independently while remaining agnostic to the underlying robot architecture. As a result, GRADE can support multiple platforms, from UAVs to robotic arms, and is not limited to individual robots or simple camera setups, unlike other simulators or frameworks.

3.3. Simulation management

The main simulation cycle, which uses Isaac Sim APIs along with custom utilities, primarily manages: (i) starting and configuring the simulation environment, (ii) loading, placing, and configuring assets and robots, (iii) executing several randomization procedures, (iv) launching complementary ROS nodes when necessary, and (v) managing simulation steps (both physics and rendering) and data saving. Through this framework, we can control various options, such as the number of dynamic assets, the initial location of the robot, and the size of the physics and rendering steps. It also allows programmatic and dynamic modification of environmental conditions, physics and rendering settings, light colors and intensity, material reflection parameters, asset textures, and the time of day, among others, thereby increasing the variability of simulations. This variability, along with the ability to independently enable or disable physics and ROS, allows us to support a wide range of simulation scenarios, thus enabling broader applicability.

In our work, we have explicitly implemented several illustrative simulations demonstrating different desirable applications, including Active-SLAM-based exploration, ROS-free data collection in a savanna environment with animated zebras, or experiment repetition, as described in Section 4. While Isaac Sim provides randomization and ground-truth data-saving methodologies, we found these functionalities limited. With GRADE, we expose and integrate the underlying methods directly into our simulation management approach for a customized experience. In particular, the default data-saving tool is restricted to camera-related information and executed each time a rendering call is made. However, as multiple rendering calls are required to generate an accurate image of complex environments due to path-tracing computations, this results in neither accurate nor comprehensive data. To address these limitations, we modify the saving process to gain finer control, fix various issues (e.g., segmentation ID overflow), and collect additional information, such as the camera’s vertical field of view and IMU measurements at each timestep. Moreover, our finer control allows us to customize sensor and ROS message rates independently of rendering and to access and modify the information before publishing, for instance, to add noise or implement custom drop rates.

3.4. Post-processing tools

Real-world sensor data is inherently noisy, exhibiting issues like measurement drift in IMUs, motion blur in images, and depth inaccuracies from sensors that follow exponential noise models. Therefore, ground-truth data generated by the engine must be processed to closely replicate real-world conditions, which is essential for training robust DL systems and evaluating methods such as Dynamic V-SLAM. Noise could also be introduced at runtime by modifying the data from the simulator before it is saved or published by ROS nodes. However, post-processing the saved ground-truth data offers greater flexibility by allowing, for example, multiple experiments with different noise levels.

Therefore, within GRADE, we develop a tool to introduce noise into the saved ground-truth data, building upon and extending methods from RotorS (Furrer et al., 2016) and Zhang et al. (Zhang et al., 2020). Specifically, our approach incorporates: (i) IMU noise and bias, (ii) RGB motion blur and rolling shutter noise, and (iii) depth filtering and noise. The noise augmentation tool is structured to allow seamless extension to additional sensor modalities. The framework processes raw data sequences or rosbags by applying user-defined perturbations to different data streams. Its structured pipeline separates data modalities (e.g., RGB, IMU), allowing different integrations with minimal modifications. We use this tool in experiments to prepare data for Dynamic V-SLAM and network training, adding noise to depth and RGB for one dataset while also augmenting segmentation masks for motion blur in another(see Sections 6.2 and 6.3). Beyond the noise augmentation, we also automate Dynamic V-SLAM evaluation and address errors in the generated ground-truth data caused by known issues in Isaac Sim, such as inaccurate 3D bounding boxes,²⁷ incorrect poses of some animated assets,²⁸^,²⁹ and timing discrepancies in rosbags.³⁰

4. GRADE case studies

We use GRADE to address three sample case studies that cover the different ways we can apply the framework. These are (i) a ROS-free simulation and curated data collection in a savanna environment (Section 4.1), (ii) online testing of Active SLAM approaches in a (multi-)robot scenario (Section 4.2), and (iii) our newly introduced experiment repetition setup (Section 4.3). Throughout this work, we use (i) a UAV Firefly model from RotorS (Furrer et al., 2016), and (ii) a three-wheeled omnidirectional robot from iRotate (Bonetto et al., 2022). As mentioned in Section 3.2, each robot is equipped with a single joint for every degree of freedom (six for the UAV and three for the omnidirectional ground robot), and their sensors (camera, IMU) are loaded dynamically at runtime.

4.1. ROS-free simulation in a savanna environment

In this example, we deploy a UAV in a savanna environment³¹ with several simulated animated zebras. The focus is on creating a custom simulation without using either ROS or any additional SIL method to control the robot. A schematic of this setup is provided in Figure 3. While the primary focus of this paper is on indoor dynamic environments, we include this experiment to demonstrate the adaptability of GRADE beyond human-centric indoor scenes. By evaluating GRADE in a completely different setting—an outdoor savanna with dynamic, non-human agents—we validate its ability to incorporate diverse environment sources, such as those available in Unreal Engine, and diverse animation sources, beyond SMPL-based humans, and extend its applicability to other domains where dynamic elements play a key role. The zebra and their animations are sourced from a free SketchFab³² asset. Using Blender, we export four animation sequences, namely, walking, eating, trotting, and jumping, create three different transition sets between these animations, and manually place the zebras within the main environment (Figure 4e, left side of Figure 1). Waypoints are provided to the Isaac engine directly from the main simulation loop as predefined position/orientation goals. We use either (i) a scripted sequence of waypoints for each of the six joints using the physics engine, or (ii) a physics-less mode where the drone acts as a floating object that “slides” smoothly between waypoints. The drone dynamics are governed by its mass and joint characteristics (e.g., damping coefficient) in the physics-enabled mode. Otherwise, the UAV follows an interpolated trajectory between goal locations.

Figure 3.

Flow diagram of the ROS-free system used in the savanna simulation presented in Section 4.1. In blue, we highlight our customizations. Here, we take a savanna environment from UE and an animated zebra model from Sketchfab. We combine them manually in a single USD environment. This USD is then loaded with the UAV robot model in Isaac Sim. The robot is controlled by the main simulation script directly, whether with or without the physics engine, which also manages the rendering and the data-saving steps.

Figure 4.

Few examples of environments that can be simulated with GRADE. The RGB images are shown in the top row, with the associated instance segmentations (randomly colored) below. For the multi-robot UAV images (Section 4.2), we highlight the other robots in the field of view with a red box. An external view of the UAV observing the city and the apartment environments can be observed in Figure 9. The images are best viewed in color.

4.2. (Multi-)Robots and active SLAM

This scenario uses GRADE in conjunction with previously developed ROS approaches. Specifically, we aim to use Active SLAM methods to explore indoor environments (from 3D-Front) with both UAVs and UGVs. The scheme of this workflow is depicted in Figure 5. We adapt and interface the FUEL (Zhou et al., 2021) Active SLAM framework with Isaac Sim to compute exploration goals for the UAV. FUEL uses online RGB-D and odometry data from the simulation to actively compute exploration goals. However, the original FUEL implementation leverages an integrated custom and simplified simulation to control the drone movements. Therefore, to bridge this gap and interface ourselves with Isaac Sim and GRADE, we supply the exploration goals to an additional NMPC (Kamel et al., 2017) to predict a realistic trajectory for the UAV. The final predicted state of this trajectory is then sent to our custom controller which then provides commands to the simulation itself. The omnidirectional robot is instead managed by the iRotate (Bonetto et al., 2022) framework in conjunction with our custom controller without any additional layer, as it natively provides NMPC velocity setpoints for the base and the independently rotating camera (Bonetto et al., 2021). In this scenario, we also create simulations that simultaneously manage multiple and heterogeneous robots. The challenge here is that we have to load specific sensor suites and set different ROS topics for each one of the robots created dynamically. We do so by allowing dynamic reconfiguration and loading of parameters within the same simulation script. An example of the multi-robot simulation is shown in Figure 4b-d. In those, two robots are UAVs, each running a different instance of FUEL, and one is the three-wheeled ground omnidirectional robot.

Figure 5.

A flow diagram of the main dataset generation pipeline presented in Sections 4.2 and 5.1. In blue, we highlight our customizations. First, we take environments from 3D-Front, flying objects (when desired) from ShapeNet and GSO, and animated humans from Cloth3D. We process both the environment and the animated humans using our custom converters, preparing the data for the simulation and the asset placement procedure. A single procedure then takes care of running the simulation, from the asset loading and placement to the data publishing and saving. The main simulation is in the loop with an Active SLAM method and a customized 6DOF controller that communicate with each other using ROS. The ground-truth data is then processed by our tools that apply fixes when needed (as described in the main paper), noise, and use it to train detection and segmentation approaches and evaluate Dynamic V-SLAM methods.

4.3. Experiment repetition and enhancement

GRADE provides a way to replay any previously recorded experiment precisely. This can be done either by keeping them as-is or by modifying them by selectively altering any number of conditions. These include, for example, attaching new sensors (e.g., cameras, LiDARs) to the original robot itself, changes in light conditions, or the inclusion of new robots, humans, animals, or other objects in the surrounding environment. This is while keeping the physics simulation enabled. Therefore, with GRADE, we introduce a new method that allows studying the robustness of different approaches by changing the robot’s surrounding conditions and/or expanding previously collected datasets under the exact same settings, for example, by collecting new data. To this end, we provide two possible solutions under the requirements that both (i) all of the robot poses throughout the experiment and (ii) the initial simulation conditions were logged. We either teleport the robot to the exact logged location and re-render the scene and the new sensors as-is, or use the previously logged joint velocities and target positions at every time step as targets of the Isaac Sim internal joint controller. While more flexible, this second approach could generate some minor deviation due to the unknown acceleration between two timesteps. However, combining the two strategies can easily mitigate this effect. Moreover, we can further interpolate the missing poses when necessary, for example, if a newly added sensor has a different rate than the one with which the pose was saved. Notably, repeating experiments with this approach removes any variability that the developed method might have—for example, when testing Active SLAM approaches. At the same time, it ensures that the remainder of the simulation is conducted under controlled and physically realistic conditions. Note that, differently from replaying rosbags, using fixed seed numbers when the simulation is prepared, or deterministic Gazebo runs, our approach allows changing the underlying state of the simulation. This would happen, for example, following modifications like adding new sensors to the robot (with mass), different scene content that is impacted by physics, additional robots in the scene, or fully disabling the physics to re-render a scene without dynamic elements, as we do in our experiments (Section 6.3.1). All of this can be controlled programmatically through the means of a simple Python script with which we can selectively choose both what to alter and then how to control the simulation itself.

Examples of the re-rendered images and depth maps during an experiment repetition run are provided in Figure 6, while the workflow can be found in Figure 7. In Figure 6 we also show an example of changes in the original simulation run by changing the scene contents and lighting conditions and observing the scene from a new perspective. We will evaluate the deviation w.r.t. the logged poses and the rendering differences in Section 6.1.

Figure 6.

Examples of RGB and depth maps generated using the experiment repetition functionality of GRADE. In the first column, we show the original RGB and depth frames. The second column depicts the re-generated sensor readings captured at the same location. In column c we show the difference between the two corresponding frames in the RGB and depth domains, that is, the difference between columns a and b. Columns d and e show the same scene with a different lighting condition, where column d uses the same viewpoint but hiding all the dynamic objects and column e changes the camera viewpoint to a different orientation. In column e the depth map is inverted for clarity. The images are best viewed in color.

Figure 7.

Flow diagram of the experiment repetition pipeline presented in Section 4.3. In blue, we highlight our customizations. Given the previously recorded experiment data, consisting of the simulation configuration, logged joint positions and velocities, and assets locations and animation sequences, we can exactly repeat the experiment while allowing a wide range of modifications. The main simulation tool loads the assets, optionally modifies the simulation content (e.g., removes objects, changes lighting conditions, adds sensors to the robot), repeats the experiment, and saves the data. Additional robots, assets, or SIL can be configured to interact with the repeated experiment by integrating them into the script that manages the simulation. The main robot can be controlled either by using teleporting or the joint control system. Both receive the logged information from the main simulation tool.

5. Data generation

We use GRADE to generate the data for our experiments in indoor dynamic environments through the aforementioned Active V-SLAM modality. Following what we described in the previous sections we use (i) the custom environment converter, to prepare the world and extract its STL and boundaries; (ii) the SMPL animation converter, to prepare the animated assets and extract the STL representing their trajectories; (iii) the custom placement procedure; (iv) GSO and ShapeNet as flying objects; and (v) our custom controller to manage the robot(s). An example of the richly annotated data generated can be seen in Figure 8 and the summary of the data we release with this work is presented in Section 5.1.

Figure 8.

An example of the data generated using our simulation framework GRADE. Top row, left to right: rendered RGB image, corresponding depth map, optical flow, and surface normals. Bottom row, left to right: 2D bounding boxes, semantic instances, semantic segmentation, and SMPL shapes. The images are best viewed in color.

Our main source of environments is the 3D-Front (Fu et al., 2021a) dataset, that is, one of the largest collections of mesh-based indoor environments. We enhance them with random textures from ambientCG³³ and varying lighting conditions. We also collect data in one outdoor city environment from the Unreal Engine marketplace³⁴ (Figures 4a Figures 9a), and once in an indoor³⁵ world from SketchFab (Figures 4b-d Figures 9b), both used also to further validate GRADE and Isaac Sim’s flexibility across different sources. The dynamic components in the scenes are animated humans and, in some experiments, random flying objects. The humans are taken from the Cloth3D (Bertiche et al., 2020) and AMASS (Mahmood et al., 2019) datasets. While Cloth3D provides clothed assets, AMASS CMU sequences consist of only unclothed SMPL fittings. While doing so, we randomize the appearance of the assets by using Surreal’s SMPL textures (Varol et al., 2017), that is, freely available low-resolution textures, as shown, for example, in Figure 8. The flying objects belong to various categories (e.g., toys, balls, tables, etc.) and serve multiple purposes: they generate occlusions between the camera and the other elements in the environment, increase the variability of the scene, and introduce dynamic elements that can challenge Dynamic V-SLAM methods. For example, they negatively impact feature rejection methods such as segmentation and detection models or optical flow approaches since they create occlusions, do not belong to common dynamic classes, and have unpredictable motions. Additionally, their presence reduces the likelihood of loop closures, as they can randomly cover the scene, further testing the robustness of the evaluated V-SLAM systems. Moreover, introducing them will allow us to automatically create images with partially covered humans, thus increasing the variability of their appearance on the images we will use to train detection and segmentation models. These objects, belonging to the GSO and ShapeNet datasets, are loaded dynamically at runtime. We rigidly “animate” them through random time-keyed transformations in scale, orientation, and position using multiple goals set within the environment limits. By design, this is done without considering any possible collision with the environment or other assets. While precise collision-avoidance strategies can be implemented, like pre-computing safe trajectories, this allows a higher variability of their motion patterns. Overall, this diverse data generation approach, enabled by GRADE, allows us to systematically analyze the impact of dynamic elements on perception and localization tasks.

Figure 9.

We show here an external view of the UAV (similar to Figure 1) in a city environment and an apartment, which are the same used in Figure 4a and 4b–9d, respectively. In (b) it is also possible to observe the UGV.

The main process is as follows. First, we load the environment, center it w.r.t. the origin, and generate the 2D occupancy map. In this, we also define the size of the physics step, that is, how much the clock will advance during every loop of the simulation, and rendering parameters, for example, auto-exposure or path/ray-tracing settings. In our experiments, the physics step is set to 1/240-th of a second. Second, we randomize lights’ colors, intensities, and surface roughness (i.e., reflection capabilities). Third, we load the robot, move it to the starting location, and attach its specific set of sensors (including the cameras), ROS publishers, and link it to the motion controller through the correct ROS topics. Based on the chosen configuration, the robot is controlled with either full 6 DOF capabilities or with a stabilized flight (without roll and pitch). The UAV’s joints are subject to position and speed limitations as shown in Table 1. Recall that the FUEL Active SLAM framework will control the UAV online during the simulation, allowing us to explore the environment and collect data autonomously. Then, we import a random number of people and randomly place them within the environment. The number of flying objects loaded varies based on the experiment settings. The initial location of the robot is randomized using the same placement procedure as the animated assets to prevent collisions. Additionally, because our automatic data generation relies on a Visual SLAM approach, we pre-optimize the robot’s initial orientation to avoid featureless areas, such as windows. In our environments, we do not include artificial panoramic backgrounds or detailed outdoor scenes typically found around homes, meaning these regions lack the necessary visual cues for robust V-SLAM initialization. By ensuring the robot does not start facing such areas, we improve both the stability of SLAM and the reliability of the automatic data generation process.

Table 1.

We Report the Joint Limits Used for the UAV in Our Data Generation Procedure, as Described in Section 5.1.

	x, y, z	Roll, Pitch	Yaw
Position	Env. limits	[−25, 25] or [0,0] deg	[0.360] deg
Speed	0.5 m/s	40 deg/s	30 deg/s

Overall, considering that all simulations have animated humans and accounting for the degrees of freedom of the robot and the presence of flying objects, we obtained six scenarios (excluding the outdoor and multi-robot setting) summarized in Table 2.

Table 2.

Summary of our generated data, including the number of sequences released for each configuration. nThe number of humans is randomly selected between 7 and 40 before placement. However, the final number of humans in the scene may be lower due to space constraints that prevented their successful placement, as explained in Section 3.1.4. The number of flying objects taken from the GSO and ShapeNet datasets is fixed. A tick in the horizontal column indicates whether the uav is free to move or constrained to a horizontal orientation (i.e., can only rotate in yaw).

Humans	GSO	ShapeNet	Horizontal	Sequences
7-40	0	0		63
	0	0	✓	77
	5	5		44
	5	5	✓	63
	10	10		33
	10	10	✓	62

After loading and setting up the simulation experiment, we bootstrap the first 1 second for every experiment to randomize the initial conditions of the robot. When the bootstrap sequence ends, we publish a single message to signal the start of the experiment and record data for 60 seconds. In the main simulation loop, we (i) advance the physics one step at a time, (ii) automatically control the animation timeline and the rendering steps, (iii) publish the ROS information at the desired rates (using the physics step as a reference), and (iv) write data to the disk. The control of the animation timeline and the rendering steps is necessary because, as mentioned previously, multiple calls to the path-tracing function are necessary to render complex scenes correctly. Each one of these calls, however, will “advance” the animations in the scene (by advancing the time on the timeline). Therefore, in GRADE, we implement a procedure to ensure a correct alignment of the physics, rendering, and animations towards a precise data generation procedure.

The scene is rendered with path-tracing and auto-exposure. The rendering speed greatly depends on the number of lights, reflections, assets in the scene, and cameras. We use two cameras with the same horizontal and vertical FOV, one low resolution (640 × 480) and one high resolution (1920 × 1080). Across different architectures, rendering each couple of views took an average of 12 seconds, including the time necessary to get the remaining ground-truth information, such as instance segmentation. However, by tuning the simulation parameters, one can improve that to multiple images per second, up to processing times faster than 15 FPS. In this, the physics simulation step causes a noticeable delay whenever high-frequency messages related to physics need to be published (e.g., the IMU) due to its tiding with the USD files and the necessity to constantly read and write information to them.³⁶ The low-resolution RGB and depth are published with ROS and saved with the rosbag tool. The high-resolution camera is used to save ground-truth data with numpy arrays such as 2D and 3D bounding boxes, instance segmentation masks, camera pose, as well as RGB and depth. A full description of the data we save with the corresponding rates is presented in Table 3. The animations are reversed based on the average movement duration to increase variability and avoid too many static figures. Note that, due to flying objects and the unpredictability of moving subjects, sometimes the drone goes “through” dynamic assets. Although a way to avoid this would be to pre-sample valid trajectories, we see this as an opportunity to develop more reliable systems. Indeed, when these events occur, they cause situations that are usually untested with currently available real-world datasets, like losing track of the features due to completely black (e.g., corrupted) images or sudden changes between frames (e.g., due to a new flying object). However, those are possible scenarios that can occur if one considers real hardware or communication links that might fail or degrade or real-world environments in which animals or anything else could cause occlusions of the camera lens even for a brief moment. Nonetheless, we are not aware of any data currently available that poses such a challenge. Importantly, our experiment repetition tool (Section 4.3) provides a means to systematically address occlusions without compromising experimental consistency. The tool enables the regeneration of frames affected by occlusions while maintaining the original experiment setup, including camera poses, scene dynamics, and lighting conditions. This is without introducing errors, as shown in Section 6.1, preserving both realism and repeatability. Additionally, this capability will allow us to evaluate how different algorithms respond to different scenarios. We bring this concept to an extreme during our evaluations of Dynamic V-SLAM methods by re-rendering the tested dynamic sequences without any moving object (Section 6.3). We provide an example of a re-rendered scene without the dynamic elements in Figure 6d.

Table 3.

For each sensor (first row), we report the frequency in Hz used during our data generation procedure (Section 5.1).

Sensor	Clock	IMU	TF	Joint state	Camera pose	Odometry	RGB, depth semantics	Starting experiment
Hz	240	240	120	120	60	60	30	once

5.1. Summary of released data and code

We release 342 sequences of 1800 frames each that, at 30 fps, correspond to 342 minutes of video, that is, 615K frames. Those are summarized in Table 2. For each of these 342 experiments, we release depth data, instance segmentation (including clothing segmentation), 2D tight and loose bounding boxes,³⁷ 3D bounding boxes, and the corresponding camera information and poses. Additionally, we release the processed animated human data with 3D per-vertex locations and skeletal information. All of the aforementioned data is saved via numpy arrays. For each sequence, we also release the recorded rosbags with IMU readings, TF tree, joint states, low-resolution RGBD images, and the robot’s state. For convenience, IMU readings, camera pose, and robot pose, which we originally stored in the rosbags, are also provided independently from them as numpy arrays. For each experiment, we provide the initial configuration of each asset, the state of the random number generator used, the USD file of the simulation, and other accompanying information necessary to replay the experiment. Other data, such as normal vectors and optical flow, can be generated using the experiment repetition tool.

All the USD files of the animated sequences and environments can be freely downloaded upon acceptance of the necessary licenses, or generated from scratch when necessary (ShapeNet objects). The source code is fully open-source. For convenience, we also release the pre-processed data to test Dynamic V-SLAM methods and the data relative to each experimented approach reported in this work in Tables 9 –12. Finally, we release the checkpoints of the trained networks, the data used to train them, as well as the labels for the TUM RGB-D dataset. A summary of the datasets used for the deep learning experiments, their training and validation splits, and the notation used in this work for these can be seen in Table 4. Some examples of the generated data, as well as external views of the drone in the environment, can be observed in Figure 1[right], Figures 4a-d, Figures 8, Figures 9a, and 9b.

Table 4.

Legend of the nomenclature used in the training and evaluation of the syn-to-real testing of the humans’ detection and segmentation models introduced in Section 6.2. In the first column, we indicate the abbreviation used throughout this work. In the second column, we report its brief description. When needed, the last two columns contain the number of samples in the training and validation sets, respectively.

	Description	Train imgs.	Val imgs.
CH	COCO’s images containing humans	64,115	2693
S-CH	COCO random subset containing humans	1256	120
TH	TUM RGB-D fr3 walking sequences	—	3580
S-GRADE	Subset of our synthetic data	16 K	2 K
A-GRADE	Our full synthetic data	473 K	118 K
* -EX	Model trained on * and saved at epoch X
BASELINE	Official model weights trained on COCO

6. Results

In this section, we first analyze the experiment repetition module in Section 6.1 to verify that the re-generated poses, images, and depth maps correspond to the ones of the original experiment. We then evaluate how well the generated data can address the syn-to-real gap in Section 6.2. Finally, we benchmark Dynamic Visual SLAM methods in Section 6.3 to verify the usability of the simulated data to evaluate such approaches. There, we will study their limitations and the relation between their performance and the underlying deep detection and segmentation models.

6.1. Experiment repetition evaluation

To test the exactness of the experiment repetition procedure we employ a 60-s experiment and compare the precision of: (i) position and orientation values over 3601 poses, and of the re-rendered (ii) RGB, and (iii) depth maps over 1800 samples. Table 5 reports the mean and standard deviation of the difference between the originally recorded and re-executed robot poses for each component. The results show that the average deviation is on the order of 1e − 7 m, demonstrating the high precision of our approach. To verify the rendering consistency, we then compute the average of the RGB image structural similarity (SSIM) (Wang et al., 2004) between the re-rendered and the original frames. The resulting average SSIM is 99.6% with a 0.15% standard deviation, which indicates near-perfect similarity. Finally, we evaluate the precision of the re-computed depth maps using the mean average difference and SSIM index w.r.t. the original ones. The mean of the average difference between the original and repeated depth maps is 0.0015 m with a standard deviation of 0.0019 m. Note that this is mostly due to aliasing effects and pixel values that lie alongside the borders of the objects. Indeed, the corresponding depth images have a structural similarity attested at 99.8% with a 0.23% standard deviation.

Table 5.

Evaluation of the precision of the robot poses obtained using the experiment repetition tool. For each component, we report the mean and the standard deviation of the difference between the repeated and the original values computed over 3601 instances. Position errors are expressed in meters while angle errors in radians.

	x	y	z	Roll	Pitch	Yaw
Mean	−3.50e − 6	−9.00e − 7	−1.94e − 7	−4.61e − 7	−8.70e − 7	−4.48e − 7
Std. Dev.	2.75e − 5	3.06e − 5	2.71e − 5	2.93e − 5	2.90e − 5	2.90e − 5

6.2. Syn-to-real transfer learning

Our objective is to demonstrate that synthetic data generated with GRADE successfully captures real-world features and enables the training of models that generalize well over real images. To this end, we evaluate GRADE’s syn-to-real transfer capabilities using two popular neural networks: YOLOv5 (Jocher et al., 2022) and Mask R-CNN (He et al., 2017). Our objective is the detection and segmentation of humans. We train the networks in three modalities: (1) from scratch with both synthetic and real data, (2) fine-tuning with real-world images the networks pre-trained on synthetic data, and (3) using datasets of mixed synthetic and real-world data, indicated with a “+” sign between the datasets’ acronyms.

To train YOLO and Mask R-CNN, we use both (i) a subset of the generated dataset, which we will refer to as S-GRADE, and (ii) the complete dataset, that is, A-GRADE. Images with a high probability of being occluded by flying objects due to peculiar depth and/or color information are automatically discarded. S-GRADE consists of 18 K frames, with dynamic humans and without flying objects. Of those, 16.2 K have humans in them and 1.8 K are only background. A-GRADE contains all available data, that is, 591K images, of which 362K have humans. We augment the images of S-GRADE using a random rolling shutter noise model (μ = 0.015, σ = 0.006) and a fixed exposure time of 0.01 following (Zhang et al., 2020). To augment A-GRADE, we use a random exposure time between 0 and 0.1 seconds for each sequence and update the segmentation masks and bounding boxes to account for the additional motion blur. This is unnecessary for S-GRADE noisy images as the noise is much lower due to the shorter (and fixed) exposure time.

The real data is obtained from the COCO dataset and the fr3/walking sequences of TUM RGB-D (Sturm et al., 2012). From COCO, we utilize only the subset of data containing humans in the frame and will call this dataset CH (COCO-Humans) from now on. CH contains 64,115 training and 2693 validation images. From those, we randomly sample 1256 training and 120 validation images, totaling $\sim 2$ % and $\sim 4$ % of CH training and validation sets. In this work, we call this subset S-CH and use it to understand how the networks perform with limited real data. The fr3/walking sequences consist of 3579 images with people, 5362 instances, and 130 background samples. We manually label those with precise bounding boxes and segmentation masks using the free version of Roboflow³⁸ and release this data publicly. We will use TH to indicate this data in our tests. CH exhibits high variability in human representation, including outdoor scenes, crowds, and diverse clothing. In contrast, the TH dataset aligns with our synthetic data, focusing on indoor dynamic sequences. We evaluate the performance with the COCO standard metric (mAP@[.5, 0.95], mAP in this work) and the PASCAL VOC’s metric (mAP@.5, mAP50 in this work). Note that, differently from PeopleSansPeople (Ebadi et al., 2022), we save the best checkpoint based on the training dataset’s validation set, that is, not using CH or its subset. We do not perform any hyperparameter tuning and use the default network settings. Our baseline models (called BASELINE henceforth) are the networks trained on the full COCO dataset. A recap of these datasets and of the notation used in the next sections is in Table 4.

6.2.1. Human detection with YOLOv5

We trained YOLOv5s using its default parameters and for the standard 300 epochs. The results are reported in Table 6.

Table 6.

YOLOv5s bounding box evaluation results. We report the mAP50 and mAP over the specified validation set. We put in bold the best result and in italics the second best. The baseline is obtained using the officially released model of YOLOv5s trained on the full COCO dataset.

		CH		TH
(Pre-)Training set	Fine-tuning set	mAP50	mAP	mAP50	mAP
BASELINE	—	0.753	0.492	0.916	0.722
S-CH	—	0.492	0.242	0.661	0.365
S-GRADE	—	0.206	0.109	0.616	0.425
S-GRADE-E50	—	0.234	0.116	0.683	0.431
A-GRADE	—	0.176	0.093	0.637	0.459
A-GRADE-E50	—	0.282	0.154	0.798	0.613
S-GRADE	S-CH	0.561	0.302	0.744	0.488
A-GRADE	S-CH	0.540	0.299	0.762	0.514
S-GRADE	CH	0.801	0.544	0.931	0.778
A-GRADE	CH	0.797	0.542	0.932	0.786
S-GRADE + S-CH	—	0.590	0.334	0.855	0.648
A-GRADE + S-CH	—	0.527	0.289	0.801	0.597
S-GRADE + CH	—	0.801	0.547	0.938	0.786
A-GRADE + CH	—	0.764	0.503	0.936	0.778

We first analyze the models trained only on single datasets. When evaluated on CH validation data, the model trained from scratch with the S-GRADE dataset exhibits lower precision than the one trained solely with S-CH. However, on TH data, these models show comparable performance, with the network trained solely on S-GRADE achieving approximately $\sim 5$ % lower mAP50 but 6% higher mAP. The network pre-trained with S-GRADE and then fine-tuned on S-CH shows a significant performance improvement w.r.t. the model trained using only S-CH. Specifically, this increase is around 6% on the CH validation (both metrics), and of +8% and +12% on the TH dataset mAP and mAP50, respectively. Similarly, fine-tuning S-GRADE with the full CH dataset surpass the baseline results by $\sim 5$ % in both metrics on CH, and ∼ + 2% mAP50 and ∼ + 5% mAP on the TH dataset. Models that use the A-GRADE dataset during training exhibit comparable or slightly worse performance on the CH validation set than the ones that use S-GRADE data (less than 3%). At the same time, using A-GRADE in place of S-GRADE data yields better performance on the TH dataset $(\sim 3 %)$ . We can therefore say that the models that use A-GRADE or S-GRADE during training tend to perform well on indoor human detection and do not generalize well to CH. This is likely due to the lack of examples in the synthetic dataset that can correctly represent the data distribution of CH. Interestingly, the models evaluated using the checkpoint saved at the 50th training epoch, identified as E50 in Table 6, consistently outperform the corresponding models trained only with A-GRADE or S-GRADE in all metrics and datasets. Moreover, A-GRADE-E50, when tested on TH data, outperforms models trained from scratch on both synthetic and S-CH datasets or fine-tuned on S-CH, achieving a remarkable 79.8% mAP50. This indicates that there would be an advantage of using the real-world data during validation, as done in PeopleSansPeople (Ebadi et al., 2022). However, this would prevent us from correctly evaluating the performance of using solely the synthetically generated images by introducing a bias when using only synthetic data.

Training using mixed synthetic and real datasets generally outperforms the corresponding pre-training and fine-tuning strategy using the same datasets. For example, the results on the TH dataset increased by up to $\sim 11$ % mAP50 and $\sim 16$ % mAP with the model trained on S-GRADE+S-CH. The only significant improvement on the CH validation set is observed with the S-GRADE+S-CH data, likely due to the difference in cardinality of the combined datasets and the data distribution of the CH data.

Overall, the best-performing model we obtained is S-GRADE+CH with an improvement over the baseline of $\sim 5$ % on both metrics on the CH dataset, and 2% and 6% on the TH dataset. These results indicate how the data generated with GRADE can be used both to address the syn-to-real gap and improve the results of the models trained only on real-world data.

6.2.2. Human detection and segmentation with mask R-CNN

We use the detectron2 (Wu et al., 2019) implementation of Mask R-CNN, using a 3x training schedule³⁹ and a ResNet50 backbone. We use the default steps (210 K and 250 K) and maximum iterations (270 K) parameters with four images per batch when training A-GRADE and CH. We reduced those to 60 K, 80 K, and 90 K when training S-GRADE and to 80 K, 108 K, and 120 K and two images per batch for S-CH due to their relatively small size. We evaluate the models every 2 K iterations and save the best one by comparing the mAP50 metric on each of the two tasks, detection and segmentation. Due to the size of the A-GRADE dataset, we evaluate the model trained from scratch on this data every 3 K iterations. We save the best model separately for each task and evaluate its accuracy using 0.05 and 0.70 confidence thresholds. Since the training and evaluation schedules greatly impact the performance of this network⁴⁰ we also train from scratch the same network with the CH data using our configuration.

The results presented in Table 7 are similar to those obtained with YOLO in the previous section. However, unlike the previous case, fine-tuning with the entire CH dataset does not improve upon the official baseline. In contrast, combining A-GRADE or S-GRADE with CH generally results in higher mAP and mAP50 than the model trained solely on CH data, emphasizing both the impact of training procedures and our data.

Table 7.

Mask R-CNN detection and segmentation results using both thresholds, i.e., 0.7 and 0.05. For both tasks, we report the mAP50 and mAP over the specified validation set. We put in bold the best results and in italics the second best. The baseline is obtained using the officially released model of Mask R-CNN trained on the full COCO dataset. We also report the model trained on CH using our training and validation schedules for a fairer comparison.

		Threshold 0.7								Threshold 0.05
		Detection				Segmentation				Detection				Segmentation
		CH		TH		CH		TH		CH		TH		CH		TH
(Pre-)Training set	Fine-tuning set	mAP50	mAP	mAP50	mAP	mAP50	mAP	mAP50	mAP	mAP50	mAP	mAP50	mAP	mAP50	mAP	mAP50	mAP
BASELINE	—	0.727	0.504	0.860	0.709	0.723	0.445	0.870	0.652	0.852	0.567	0.910	0.738	0.828	0.493	0.913	0.675
CH	—	0.693	0.471	0.829	0.653	0.681	0.410	0.838	0.584	0.829	0.537	0.898	0.692	0.801	0.461	0.890	0.611
S-CH	—	0.340	0.161	0.526	0.250	0.351	0.155	0.543	0.231	0.439	0.195	0.610	0.282	0.392	0.168	0.568	0.241
S-GRADE	—	0.128	0.064	0.563	0.312	0.100	0.043	0.509	0.264	0.167	0.077	0.637	0.343	0.117	0.048	0.561	0.283
A-GRADE	—	0.202	0.115	0.727	0.502	0.178	0.088	0.709	0.408	0.269	0.140	0.784	0.531	0.214	0.100	0.749	0.425
S-GRADE	S-CH	0.428	0.232	0.708	0.412	0.401	0.195	0.665	0.374	0.518	0.265	0.748	0.432	0.465	0.216	0.694	0.387
A-GRADE	S-CH	0.450	0.262	0.736	0.489	0.460	0.231	0.758	0.449	0.560	0.303	0.788	0.515	0.515	0.247	0.780	0.458
S-GRADE	CH	0.693	0.474	0.858	0.679	0.682	0.415	0.858	0.611	0.833	0.539	0.916	0.713	0.805	0.467	0.905	0.633
A-GRADE	CH	0.714	0.489	0.869	0.696	0.710	0.430	0.869	0.638	0.843	0.550	0.916	0.728	0.813	0.476	0.908	0.660
S-GRADE + S-CH	—	0.297	0.154	0.650	0.381	0.268	0.126	0.608	0.321	0.408	0.194	0.724	0.417	0.344	0.149	0.661	0.346
A-GRADE + S-CH	—	0.300	0.168	0.791	0.561	0.283	0.138	0.746	0.467	0.384	0.200	0.842	0.588	0.335	0.155	0.779	0.483
S-GRADE + CH	—	0.683	0.463	0.859	0.676	0.671	0.401	0.849	0.603	0.821	0.528	0.917	0.709	0.790	0.452	0.896	0.626
A-GRADE + CH	—	0.563	0.356	0.846	0.659	0.540	0.306	0.846	0.587	0.713	0.422	0.903	0.689	0.669	0.355	0.888	0.608

These tests also show that, when using the A-GRADE dataset, we can consistently outperform the corresponding model that uses S-GRADE in both datasets and tasks. At the same time, the model trained only with A-GRADE synthetic data performs worse than the one trained only with S-CH real data when evaluated on CH. Still, the result is reversed on the models validated on the TH dataset, with one trained only with A-GRADE outperforming the one trained only with S-CH data of up to $\sim 25$ %. This is true considering both tasks and both confidence thresholds.

Overall, the best model is obtained by pre-training on A-GRADE and fine-tuning on CH. This training strategy yields an improvement of $\sim 2$ % and 3 − 5% on CH and TUM datasets, respectively. The second-best model is the one trained on S-GRADE and fine-tuned on CH. The models trained on mixed data, in general, perform worse than the corresponding fine-tuned counterparts on the CH dataset while achieving similar results on the TUM dataset. These differences, also w.r.t. YOLO findings, are likely due to imbalanced synthetic and real data, and the scheduling, which greatly affects the number of epochs. Indeed, the model trained on the mixed S-GRADE+CH data performs similarly to the model pre-trained on S-GRADE and fine-tuned on CH.

Notably, unlike prior work such as Bayraktar et al. (2018), our approach achieves strong (although not perfect) generalization to real-world images using only synthetic data, even without incorporating real data in the validation set. Similar to their findings, mixing datasets leads to improved performance over the baseline. However, an equally crucial factor is the visual realism of the simulation itself, as one aims to avoid retraining perception models for every new task or system verification. Moreover, for better generalization, a system trained solely on synthetic data can be applied to out-of-distribution tasks where annotated datasets or large volumes of real images are unavailable (Black et al., 2023; Bonetto and Ahmad, 2023, 2024; Saini et al., 2022; Zuffi et al., 2017). Our results highlight the effectiveness of our approach in leveraging synthetic data for robust generalization. This reinforces the potential of our method for applications where real annotated data is scarce or unavailable.

6.3. Dynamic Visual SLAM

With these evaluations, we pursue the following objectives. First, we must verify that the synthetic data generated with GRADE can be successfully used to evaluate V-SLAM approaches in static environments. Then, we benchmark current state-of-the-art methods using simulated runs in dynamic indoor environments. We do so in Section 6.3.1. Finally, we will study in Section 6.3.2 the impact that the performance of the underlying detection and segmentation models have on two different Dynamic V-SLAM methods using both synthetic and real-world sequences.

To perform these evaluations we select eight different runs among all the sequences generated in Section 5. Each is 60 seconds long and comprises RGB-D (30 fps), IMU (240 Hz), and ground-truth pose (60 Hz) data. The eight runs are divided and labeled as follows: two are recorded in static environments (S), two contain dynamic people and no flying objects (D), two have both people and flying objects (F), and two present an occlusion of the camera (WO). The occlusion creates some challenging completely black frames during the experiments. For each one of these kinds (S, D, F, WO), the UAV is either: kept horizontal, in which case it cannot perform roll and pitch movements, or is free to move, to increase variability in our evaluations. If the robot is not free to move we post-fix the label of the run with the letter H, that is, SH, DH, FH, and WOH. To complete our evaluations, we also use our experiment repetition procedure to re-render all the dynamic sequences into their static counterparts by disabling all the dynamic assets and re-rendering RGB and depth data. We indicate those by post-fixing -static. This allows us to study the effects of the dynamic entities in our experiments. Statistics of the average camera (absolute) speed, average (absolute) acceleration, number of dynamic frames, and average portion of the dynamic frame belonging to dynamic content of the S, SH, D, DH, WO, WOH, F, and FH sequences are reported in Table 8. The average speed and acceleration are obtained from the ground-truth values of the odometry recorded at 60 Hz. We note that Table 8 shows small velocity and acceleration components in the roll and pitch axes for the DH experiment. However, in DH, the robot is constrained to remain horizontal, meaning that roll and pitch velocities should theoretically be zero. Upon inspection of the data, we confirmed that these deviations resulted from a minor collision between the robot and the environment, which introduced unintended but minimal rotational effects.

Table 8.

Motion and dynamic frames analysis for each one of the dynamic sequences used in our Dynamic V-SLAM benchmarking (Section 6.3).

	Average speed [x, y, z], m/s	Average acceleration [x, y, z], m/s²	Dynamic frames	Dynamic frames coverage (%)
	[roll, pitch, yaw], rad/s	[roll, pitch, yaw], rad/s²	Dynamic frames	Dynamic frames coverage (%)
FH	[0.084, 0.091, 0.014]	[0.425, 0.551, 0.123]	1800	34.06
FH	[0.000, 0.000, 0.444]	[0.000, 0.000, 0.698]	1800	34.06
F	[0.244, 0.239, 0.139]	[0.891, 0.882, 0.740]	1247	11.55
F	[0.189, 0.184, 0.436]	[1.625, 1.564, 1.242]	1247	11.55
DH	[0.255, 0.185, 0.046]	[1.011, 0.938, 0.416]	959	8.06
DH	[0.029, 0.014, 0.476]	[0.291, 0.158, 1.659]	959	8.06
D	[0.293, 0.264, 0.127]	[1.206, 1.248, 0.895]	1196	10.01
D	[0.288, 0.269, 0.422]	[1.895, 1.797, 1.070]	1196	10.01
WOH	[0.275, 0.255, 0.086]	[0.949, 1.025, 0.553]	1181	11.33
WOH	[0.000, 0.000, 0.433]	[0.000, 0.000, 0.725]	1181	11.33
WO	[0.224, 0.249, 0.103]	[1.011, 1.059, 0.634]	1763	17.94
WO	[0.214, 0.220, 0.416]	[1.892, 1.564, 1.278]	1763	17.94
SH	[0.304, 0.254, 0.047]	[0.965, 0.908, 0.273]	0	—
SH	[0.000, 0.000, 0.426]	[0.000, 0.000, 1.018]	0	—
S	[0.259, 0.236, 0.114]	[0.915, 0.822, 0.538]	0	—
S	[0.213, 0.228, 0.435]	[1.592, 1.571, 1.013]	0	—

The ground-truth data generated by the simulator is processed before the evaluations to make it closer to real-world conditions. Depth data is first limited to 3.5 m—a reasonable value when using, for example, a RealSense D435i. Then we limit it to 5 m, to study the effect of the depth range on the SLAM methods. We evaluate the SLAM frameworks using the same data but enhanced with additional noise. The noise applied to the depth values is based on the RealSense noise model.⁴¹ We apply random rolling shutter noise (μ = 0.015, σ = 0.006) and blur (following Zhang et al. (2020)) to the RGB data. The IMU drift and noise parameters are taken from Furrer et al. (2016).

We will evaluate two static V-SLAM methods and four Dynamic V-SLAM approaches. Those are: (i) RTABMap (Labbé and Michaud, 2019) and (ii) ORB-SLAM2 (Mur-Artal and Tardós, 2017), which do not explicitly address dynamic entities; (iii) DynaSLAM (Bescos et al., 2018), which uses Mask R-CNN to segment dynamic content; iii-iv) DynamicVINS (Liu et al., 2022) (in both its VO and VIO variations, abbreviated to DynaVINS here), which uses YOLO to detect it; (v) StaticFusion (Scona et al., 2018), a non-learning method that performs RGB-D based clustering; and (vi) TartanVO (Wang et al., 2020), that is, a learned visual odometry system developed specifically for challenging scenarios. For fairness, we choose to not modify the parameters of any of the SLAM approaches we benchmark. Despite that, we had to increase the number of extracted features to 3000 in both DynaSLAM and ORB-SLAM2 to allow for an easier and repeatable initialization in some sequences, and edit DynaVINS, taking inspiration from VINSFusion (Qin et al., 2019), to keep the system running in case of a tracking failure.

We report the absolute trajectory error (ATE) (Bujanca et al., 2021) and the total time a considered V-SLAM framework can successfully track the trajectory. The latter, expressed as tracking rate (TR), is a critical quantity to be considered. It helps the reader put ATE values in perspective whenever the tested method fails due to some featureless frames or occlusions, as the ATE alone cannot completely quantify the robustness (Bujanca et al., 2021). The ATE has been computed using the standard TUM RGB-D evaluation tool. We perform 10 different trials and report the mean and standard deviation of both metrics for each test.

6.3.1. Visual SLAM performance

The results are presented in the Tables 9 –11. The first and most noticeable result we can observe is that TartanVO and StaticFusion consistently exhibit high ATE and perfect TR. TartanVO’s performance can be attributed to poor generalization capabilities due to the domain gap between its training data and GRADE scenes along with its reliance on RGB information alone. In contrast, StaticFusion is sensitive to parameter tuning, as observed also in other studies (e.g., Runz et al., 2018).

Table 9.

ATE RMSE [M] and Tracking Rate (TR) obtained by evaluating different state-of-the-art models on the selected GRADE sequences (rows) re-rendered in their static versions using our experiment repetition tool. We report the mean and standard deviation over ten trials. The columns indicate the evaluated method and the two metrics considered. The depth data for these experiments is limited to 3.5 m, without additional noise.

		DynaVINS—VO		DynaVINS—VIO		StaticFusion		TartanVO		DynaSLAM		ORB-SLAM2		RTABMap
		ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR
D-static	mean	0.366	0.992	0.158	0.989	2.878	0.999	1.732	1.000	0.047	0.985	0.020	0.984	0.090	0.198
D-static	std	0.330	0.002	0.012	0.000	0.000	0.000	0.000	0.000	0.051	0.021	0.003	0.020	0.000	0.000
DH-static	mean	1.548	0.737	6.768	0.993	2.745	0.999	1.173	1.000	0.008	0.107	0.005	0.189	0.086	0.655
DH-static	std	0.192	0.220	0.202	0.000	0.000	0.000	0.000	0.000	0.006	0.038	0.000	0.007	0.000	0.000
F-static	mean	1.176	0.825	1.344	0.982	2.823	0.999	3.765	1.000	0.621	0.841	0.327	0.904	0.049	0.317
F-static	std	0.404	0.262	0.549	0.000	0.000	0.000	0.000	0.000	0.498	0.035	0.481	0.006	0.000	0.000
FH-static	mean	0.020	0.993	0.038	0.990	0.100	0.999	0.531	1.000	0.011	1.000	0.010	1.000	0.093	1.000
FH-static	std	0.002	0.000	0.002	0.001	0.000	0.000	0.000	0.000	0.002	0.000	0.004	0.000	0.000	0.000
WO-static	mean	0.806	0.990	0.201	0.976	0.062	0.999	2.437	1.000	0.036	0.793	0.040	0.752	0.120	1.000
WO-static	std	0.442	0.003	0.015	0.007	0.000	0.000	0.000	0.000	0.029	0.288	0.025	0.342	0.000	0.000
WOH-static	mean	1.473	0.984	0.139	0.985	1.469	0.999	2.610	1.000	0.012	0.538	0.015	0.538	0.208	1.000
WOH-static	std	0.264	0.008	0.014	0.000	0.000	0.000	0.000	0.000	0.004	0.000	0.009	0.000	0.000	0.000

Table 10.

ATE RMSE [M] and Tracking Rate (TR) obtained by evaluating different state-of-the-art models on the selected GRADE sequences (rows) grouped in their ground-truth and noisy versions. The ground-truth sequences are reported in the upper half of the table, while the ones with added noise are in the bottom half. We report the mean and standard deviation over ten trials. The columns indicate the evaluated method and the two metrics considered. The depth data for these experiments is limited to 3.5 m.

			DynaVINS—VO		DynaVINS—VIO		StaticFusion		TartanVO		DynaSLAM		ORB-SLAMv2		RTABMap
			ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR
Ground-truth data	D	mean	1.450	0.888	0.192	0.989	1.212	0.999	1.264	1.000	0.042	0.830	0.283	0.981	0.417	0.891
	D	std	0.441	0.096	0.010	0.000	0.000	0.000	0.000	0.000	0.009	0.111	0.036	0.021	0.000	0.000
	DH	mean	1.582	0.644	8.020	0.993	1.664	0.999	1.259	1.000	0.011	0.097	0.005	0.179	0.091	0.654
	DH	std	0.468	0.301	0.282	0.000	0.000	0.000	0.000	0.000	0.007	0.028	0.001	0.007	0.000	0.000
	F	mean	1.532	0.841	2.057	0.980	2.866	0.999	4.132	1.000	0.858	0.440	0.294	0.565	0.086	0.219
	F	std	0.504	0.230	0.478	0.001	0.000	0.000	0.000	0.000	0.184	0.115	0.151	0.228	0.000	0.000
	FH	mean	0.220	0.993	0.075	0.989	0.085	0.999	0.551	1.000	0.258	1.000	0.295	1.000	0.324	1.000
	FH	std	0.058	0.000	0.006	0.000	0.000	0.000	0.000	0.000	0.054	0.000	0.067	0.000	0.000	0.000
	WO	mean	1.219	0.910	0.582	0.957	2.807	0.999	2.473	1.000	0.090	0.079	0.157	0.197	0.275	0.197
	WO	std	0.291	0.092	0.072	0.003	0.000	0.000	0.000	0.000	0.011	0.000	0.007	0.000	0.000	0.000
	WOH	mean	1.474	0.808	0.223	0.981	1.980	0.999	2.361	1.000	0.013	0.538	0.011	0.538	0.088	0.569
	WOH	std	0.548	0.120	0.027	0.002	0.000	0.000	0.000	0.000	0.002	0.000	0.002	0.000	0.000	0.000
	S	mean	0.036	0.993	0.222	0.991	7.919	0.999	1.205	1.000	0.011	1.000	0.011	1.000	0.084	1.000
	S	std	0.003	0.000	0.010	0.000	0.000	0.000	0.000	0.000	0.001	0.000	0.001	0.000	0.000	0.000
	SH	mean	0.029	0.993	0.119	0.991	0.594	0.999	2.395	1.000	0.011	1.000	0.012	1.000	0.089	1.000
	SH	std	0.005	0.000	0.007	0.000	0.000	0.000	0.000	0.000	0.003	0.000	0.002	0.000	0.000	0.000
Noisy data	D	mean	1.362	0.899	0.693	0.989	2.278	0.999	1.356	1.000	0.061	0.554	0.725	0.877	0.401	0.606
	D	std	0.355	0.090	0.086	0.000	0.000	0.000	0.000	0.000	0.033	0.002	0.052	0.035	0.000	0.000
	DH	mean	1.628	0.609	1.982	0.993	1.091	0.999	1.234	1.000	0.003	0.051	0.004	0.054	0.052	0.175
	DH	std	0.642	0.338	0.268	0.000	0.000	0.000	0.000	0.000	0.001	0.000	0.000	0.000	0.000	0.000
	F	mean	2.039	0.855	2.431	0.975	3.992	0.999	4.223	1.000	0.212	0.255	0.142	0.258	0.047	0.191
	F	std	0.575	0.193	1.616	0.002	0.000	0.000	0.000	0.000	0.037	0.037	0.017	0.028	0.000	0.000
	FH	mean	0.455	0.991	0.207	0.988	0.854	0.999	0.582	1.000	0.244	0.986	0.240	1.000	0.184	1.000
	FH	std	0.199	0.003	0.104	0.001	0.000	0.000	0.000	0.000	0.062	0.042	0.081	0.000	0.000	0.000
	WO	mean	1.219	0.844	1.085	0.955	2.213	0.999	2.380	1.000	0.099	0.079	0.181	0.197	0.329	0.197
	WO	std	0.116	0.207	0.275	0.002	0.000	0.000	0.000	0.000	0.009	0.000	0.006	0.000	0.000	0.000
	WOH	mean	1.364	0.910	0.560	0.981	1.826	0.999	2.399	1.000	0.032	0.536	0.021	0.536	0.118	0.569
	WOH	std	0.541	0.115	0.086	0.001	0.000	0.000	0.000	0.000	0.011	0.000	0.002	0.000	0.000	0.000
	S	mean	0.067	0.993	0.200	0.991	3.538	0.999	1.306	1.000	0.022	1.000	0.024	1.000	0.212	1.000
	S	std	0.007	0.000	0.010	0.000	0.000	0.000	0.000	0.000	0.001	0.000	0.001	0.000	0.000	0.000
	SH	mean	0.073	0.993	0.693	0.991	4.184	0.999	2.517	1.000	0.017	1.000	0.018	1.000	0.092	1.000
	SH	std	0.008	0.000	0.408	0.000	0.000	0.000	0.000	0.000	0.001	0.000	0.002	0.000	0.000	0.000

Table 11.

			DynaVINS—VO		DynaVINS—VIO		StaticFusion		TartanVO		DynaSLAM		ORB-SLAMv2		RTABMap
			ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR
Ground-truth data	D	mean	1.429	0.713	0.195	0.989	22.374	0.999	1.264	1.000	0.046	0.805	0.251	0.995	0.547	0.995
	D	std	0.275	0.265	0.009	0.000	0.000	0.000	0.000	0.000	0.009	0.070	0.053	0.000	0.000	0.000
	DH	mean	1.692	0.655	7.926	0.993	14.938	0.999	1.259	1.000	0.014	0.087	0.006	0.176	0.097	0.646
	DH	std	0.568	0.307	0.290	0.000	0.000	0.000	0.000	0.000	0.010	0.026	0.002	0.005	0.000	0.000
	F	mean	1.733	0.789	2.262	0.981	2.781	0.999	4.132	1.000	0.708	0.420	0.408	0.487	0.091	0.219
	F	std	0.370	0.302	0.750	0.000	0.000	0.000	0.000	0.000	0.311	0.127	0.245	0.151	0.000	0.000
	FH	mean	0.424	0.991	0.073	0.989	0.059	0.999	0.551	1.000	0.274	0.987	0.223	1.000	0.200	1.000
	FH	std	0.290	0.003	0.003	0.000	0.000	0.000	0.000	0.000	0.072	0.038	0.083	0.000	0.000	0.000
	WO	mean	1.315	0.772	0.585	0.963	1.418	0.999	2.473	1.000	0.094	0.079	0.134	0.197	0.210	0.197
	WO	std	0.361	0.259	0.068	0.003	0.000	0.000	0.000	0.000	0.013	0.000	0.007	0.000	0.000	0.000
	WOH	mean	1.683	0.900	0.234	0.982	4.926	0.999	2.361	1.000	0.012	0.538	0.011	0.538	0.087	0.569
	WOH	std	0.452	0.092	0.027	0.001	0.000	0.000	0.000	0.000	0.001	0.000	0.001	0.000	0.000	0.000
	S	mean	0.044	0.993	0.226	0.991	22.282	0.999	1.205	1.000	0.015	1.000	0.013	1.000	0.081	1.000
	S	std	0.002	0.000	0.007	0.000	0.000	0.000	0.000	0.000	0.002	0.000	0.002	0.000	0.000	0.000
	SH	mean	0.031	0.993	0.116	0.991	2.721	0.999	2.395	1.000	0.013	1.000	0.012	1.000	0.087	1.000
	SH	std	0.006	0.000	0.005	0.000	0.000	0.000	0.000	0.000	0.003	0.000	0.002	0.000	0.000	0.000
Noisy data	D	mean	1.438	0.826	0.601	0.989	10.123	0.999	1.350	1.000	0.057	0.686	0.683	0.835	0.529	0.902
	D	std	0.298	0.225	0.059	0.000	0.000	0.000	0.000	0.000	0.009	0.160	0.067	0.075	0.000	0.000
	DH	mean	1.779	0.846	2.672	0.993	23.295	0.999	1.214	1.000	0.003	0.051	0.005	0.054	0.056	0.183
	DH	std	0.407	0.118	0.646	0.000	0.000	0.000	0.000	0.000	0.001	0.000	0.000	0.000	0.000	0.000
	F	mean	1.910	0.888	5.411	0.973	2.661	0.999	4.192	1.000	0.220	0.274	0.128	0.268	0.123	0.218
	F	std	0.405	0.199	3.811	0.002	0.000	0.000	0.000	0.000	0.034	0.025	0.023	0.030	0.000	0.000
	FH	mean	0.411	0.993	0.150	0.988	2.379	0.999	0.568	1.000	0.273	1.000	0.279	1.000	0.158	1.000
	FH	std	0.149	0.000	0.017	0.001	0.000	0.000	0.000	0.000	0.051	0.000	0.082	0.000	0.000	0.000
	WO	mean	1.180	0.890	1.117	0.960	1.724	0.999	2.399	1.000	0.113	0.080	0.135	0.197	0.265	0.197
	WO	std	0.409	0.202	0.120	0.003	0.000	0.000	0.000	0.000	0.014	0.002	0.007	0.000	0.000	0.000
	WOH	mean	1.351	0.869	0.534	0.980	2.691	0.999	2.389	1.000	0.037	0.536	0.019	0.536	0.081	0.539
	WOH	std	0.230	0.131	0.085	0.001	0.000	0.000	0.000	0.000	0.006	0.000	0.002	0.000	0.000	0.000
	S	mean	0.052	0.993	0.202	0.991	21.558	0.999	1.259	1.000	0.026	1.000	0.027	1.000	0.088	1.000
	S	std	0.003	0.000	0.011	0.000	0.000	0.000	0.000	0.000	0.002	0.000	0.002	0.000	0.000	0.000
	SH	mean	0.057	0.993	0.496	0.991	5.602	0.999	2.537	1.000	0.018	1.000	0.018	1.000	0.082	1.000
	SH	std	0.006	0.000	0.298	0.000	0.000	0.000	0.000	0.000	0.002	0.000	0.002	0.000	0.000	0.000

The results of testing the selected methods on the re-rendered static sequences are presented in Table 9. In Table 10 and Table 11 instead we report the experiments on S and SH. We can observe how, generally, all methods perform well in SH and S, with low ATE and high tracking rates. Meanwhile, in the other sequences, the results vary a lot, though at least one method consistently achieves good performance. These results show that the data generated by GRADE can be used effectively to perform visual odometry and demonstrate, at the same time, the low adaptation capabilities of some of these algorithms. The low tracking rates of RTABMap on D-static, DH-static, and F-static are to be associated with events in which the system loses track of the odometry and resets, without recovering. Notably, while both ATE and TR vary across methods, the standard deviations are generally low.

For dynamic sequences, the good ATE results of certain methods can be misleading. For example, in four out of eight sequences without added noise where the depth data is limited to 3.5 m (Table 10) DynaSLAM loses track of the trajectory for at least $\sim 27$ seconds, and up to 54 seconds. This highlights the importance of reporting both the tracking rate and the ATE errors when evaluating dynamic SLAM methods, as we will also see in the next section. Although all the evaluated methods show compelling results when tested with common datasets like TUM RGB-D or EuRoC, they exhibit several limitations when tested on our data—irrespective of noise presence or differences in the depth limits. We can see that most of the experiments performed on noisy data are, as expected, slightly worse than those performed using ground-truth images and depth values. ORB-SLAM and DynaSLAM yield, for the most part, comparable results on both metrics. DynaVINS VO generally performs worse than its VIO counterpart, showing its reliance on the IMU sensor and the low robustness of the VO approach. StaticFusion, counterintuitively, shows degrading performance when tested with the depth data limited to 5 m, with generally not compelling results. DynaVINS appears, on average, to be the most stable method across different testing methodologies, remaining unaffected by changes in depth ranges or data noise. However, its results are not easily repeatable, as indicated by the standard deviation values. DynaVINS VIO seems to be the best-performing method if we consider ATE and TR jointly. However, despite the use of the IMU, the ATE obtained with DynaVINS VIO on the DH non-noisy sequence is over 8 m for just a 60 s sequence, and just 1.582 m for the VO counterpart, as reported in Table 10. Similar results can be observed in the S and SH sequences, where DynaVINS VO shows 3 to 10 times lower ATE than the one obtained with DynaVINS VIO, indicating how the method does not always benefit from using the IMU sensor. With certain combinations, SLAM methods perform worse on statically rendered sequences than on dynamic ones, like RTABMap when tested on the D static and dynamic sequences. This may be because, in some cases, the methods use features from undetected dynamic content to track camera movement better when facing featureless areas like plain walls.

6.3.2. Dynamic V-SLAM and deep learning relation

Here, we evaluate DynaVINS (VO) and DynaSLAM jointly with several models we trained in the previous section using YOLOv5 and Mask R-CNN. We assess their performance on both synthetic sequences with the depth limited to 3.5 m of GRADE (without noise) and the TUM fr3/walking sequences (Sturm et al., 2012). The results are presented in Table 12. For Mask R-CNN we use the best-performing model on the segmentation task. The baseline results on the TUM RGB-D sequences are obtained using the baseline models. Note that we were unable to reproduce the published results for DynaVINS, especially the rpy and static sequences. This discrepancy can be attributed to several factors. First, there can be non-deterministic behavior due to variations in feature extraction, loop closure detection, and optimization. As a result, exact reproducibility across different runs is inherently challenging (Bescos et al., 2018). Additionally, differences in parameter tuning and the number of runs performed may have contributed to the observed discrepancies. Furthermore, variations in computing environments (e.g., CUDA versions, CPU, hardware configurations), compilation methods, and versions of installed libraries (e.g., OpenCV, YOLO) may influence the results. Note that for all sequences we can observe specific runs that have performance that are similar to the published results and within the computed standard deviation over 10 trials. However, the original paper (Liu et al., 2022), does not report either standard deviation or number of trials, making this comparison difficult. Despite these variations, all other experiments in our study were conducted within the same system, ensuring consistency in our evaluations.

Table 12.

ATE RMSE [M] and Tracking Rate (TR) obtained by evaluating DynaVINS and DynaSLAM while varying the model checkpoint used by the underlying YOLO and Mask R-CNN networks (columns). The evaluations are performed using both the TUM RGB-D (top half of the table) and selected GRADE sequences (bottom half of the table). We report the mean and standard deviation over ten trials. GRADE’s data has no additional noise. The depth data for these experiments is limited to 3.5 m.

			Training	S-GRADE + S-CH		A-GRADE		S-GRADE		S-GRADE + CH		S-GRADE		A-GRADE		A-GRADE + CH		A-GRADE		A-GRADE + S-CH		S-GRADE		S-CH		BASELINE
			Fine-tuning									CH		S-CH				CH				S-CH
				ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR	ATE	TR
TUM RGB-D	DynaVINS—VO	halfsphere	mean	0.074	0.968	0.062	0.968	0.051	0.968	0.063	0.968	0.064	0.968	0.100	0.967	0.060	0.968	0.069	0.968	0.064	0.968	0.072	0.967	0.148	0.966	0.097	0.967
		halfsphere	std	0.022	0.000	0.026	0.000	0.008	0.000	0.007	0.000	0.013	0.000	0.074	0.004	0.006	0.000	0.020	0.000	0.023	0.000	0.034	0.003	0.178	0.004	0.080	0.004
		rpy	mean	0.166	0.975	0.155	0.975	0.230	0.975	0.129	0.974	0.169	0.967	0.108	0.975	0.142	0.972	0.115	0.972	0.143	0.975	0.131	0.975	0.136	0.975	0.130	0.972
		rpy	std	0.028	0.000	0.018	0.000	0.023	0.000	0.018	0.004	0.061	0.008	0.015	0.000	0.048	0.006	0.016	0.005	0.020	0.000	0.012	0.000	0.030	0.000	0.027	0.005
		Static	mean	0.357	0.981	0.135	0.972	0.279	0.978	0.050	0.970	0.141	0.972	0.091	0.972	0.123	0.855	0.071	0.970	0.212	0.980	0.325	0.976	0.291	0.919	0.542	0.968
		Static	std	0.373	0.005	0.133	0.007	0.188	0.007	0.021	0.006	0.182	0.007	0.071	0.007	0.072	0.189	0.054	0.006	0.065	0.007	0.268	0.008	0.142	0.115	0.757	0.025
		xyz	mean	0.049	0.970	0.039	0.970	0.050	0.970	0.044	0.970	0.045	0.970	0.041	0.970	0.043	0.970	0.046	0.970	0.047	0.970	0.044	0.970	0.040	0.849	0.055	0.970
		xyz	std	0.013	0.000	0.006	0.000	0.010	0.000	0.010	0.000	0.010	0.000	0.005	0.000	0.005	0.000	0.011	0.000	0.008	0.000	0.008	0.000	0.007	0.241	0.014	0.000
	DynaSLAM	halfsphere	mean	0.031	1.000	0.029	1.000	0.029	0.999	0.029	0.999	0.030	0.999	0.029	1.000	0.029	0.999	0.028	1.000	0.029	1.000	0.031	0.971	0.031	0.999	0.028	1.000
		halfsphere	std	0.002	0.000	0.002	0.000	0.001	0.003	0.002	0.002	0.001	0.003	0.001	0.001	0.001	0.003	0.001	0.000	0.001	0.001	0.002	0.049	0.001	0.002	0.002	0.001
		rpy	mean	0.063	0.939	0.048	0.974	0.100	0.972	0.037	0.865	0.037	0.855	0.035	0.874	0.036	0.897	0.041	0.860	0.042	0.896	0.038	0.831	0.039	0.833	0.035	0.846
		rpy	std	0.047	0.029	0.015	0.023	0.021	0.003	0.004	0.039	0.004	0.021	0.003	0.019	0.006	0.034	0.011	0.019	0.005	0.042	0.003	0.025	0.005	0.043	0.006	0.022
		Static	mean	0.007	1.000	0.006	1.000	0.007	0.997	0.007	0.850	0.007	0.839	0.007	0.974	0.008	0.971	0.007	0.839	0.007	0.998	0.007	1.000	0.011	1.000	0.007	0.957
		Static	std	0.000	0.000	0.000	0.000	0.000	0.000	0.000	0.036	0.000	0.000	0.000	0.001	0.000	0.000	0.001	0.000	0.000	0.001	0.000	0.000	0.002	0.000	0.000	0.002
		xyz	mean	0.016	0.979	0.016	0.956	0.016	0.980	0.016	0.915	0.016	0.915	0.016	0.922	0.016	0.915	0.015	0.915	0.016	0.915	0.016	1.000	0.016	0.979	0.015	0.915
		xyz	std	0.001	0.000	0.001	0.000	0.000	0.015	0.000	0.000	0.001	0.000	0.000	0.000	0.001	0.000	0.001	0.000	0.001	0.000	0.001	0.000	0.000	0.024	0.000	0.001
GRADE	DynaVINS—VO	D	mean	1.363	0.657	1.497	0.917	1.410	0.808	1.186	0.893	1.382	0.926	1.504	0.818	1.488	0.927	1.488	0.803	1.345	0.865	1.403	0.945	1.348	0.793	1.450	0.888
		D	std	0.331	0.277	0.315	0.085	0.259	0.220	0.448	0.097	0.273	0.090	0.280	0.226	0.391	0.089	0.250	0.303	0.323	0.232	0.260	0.075	0.346	0.297	0.441	0.096
		DH	mean	1.621	0.712	1.368	0.691	1.501	0.770	2.330	0.416	1.348	0.590	1.412	0.685	1.380	0.659	2.109	0.577	2.296	0.676	1.949	0.598	1.345	0.640	1.582	0.644
		DH	std	0.228	0.247	0.271	0.174	0.501	0.201	2.139	0.287	0.267	0.265	0.343	0.280	0.405	0.256	2.523	0.208	2.143	0.252	1.690	0.281	0.222	0.218	0.468	0.301
		F	mean	1.490	0.746	1.701	0.820	1.628	0.755	1.743	0.741	1.688	0.721	1.571	0.946	1.723	0.819	1.511	0.743	1.794	0.807	1.747	0.825	1.604	0.880	1.532	0.841
		F	std	0.200	0.324	0.480	0.266	0.524	0.322	0.366	0.310	0.439	0.325	0.607	0.063	0.376	0.283	0.310	0.332	0.319	0.276	0.351	0.254	0.311	0.230	0.504	0.230
		FH	mean	0.382	0.991	0.355	0.991	0.541	0.991	0.236	0.993	0.287	0.992	0.233	0.992	0.404	0.991	0.414	0.963	0.221	0.948	0.263	0.966	0.258	0.992	0.220	0.993
		FH	std	0.259	0.003	0.209	0.003	0.374	0.003	0.073	0.000	0.218	0.002	0.154	0.002	0.295	0.003	0.214	0.081	0.057	0.091	0.107	0.079	0.219	0.002	0.058	0.000
		WO	mean	1.130	0.814	1.186	0.659	1.344	0.839	1.334	0.864	1.199	0.755	1.184	0.766	1.331	0.855	1.279	0.790	1.195	0.857	1.469	0.772	1.473	0.887	1.219	0.910
		WO	std	0.242	0.287	0.294	0.278	0.303	0.208	0.433	0.207	0.198	0.304	0.236	0.289	0.506	0.229	0.445	0.202	0.225	0.203	0.399	0.264	0.289	0.102	0.291	0.092
		WOH	mean	1.389	0.847	1.186	0.878	1.324	0.904	1.492	0.874	1.935	0.937	1.710	0.928	1.575	0.807	1.283	0.838	1.195	0.867	1.458	0.862	1.529	0.897	1.474	0.808
		WOH	std	0.522	0.146	0.506	0.104	0.340	0.143	0.572	0.139	1.117	0.075	0.509	0.071	0.725	0.158	0.479	0.156	0.415	0.177	0.429	0.162	0.418	0.137	0.548	0.120
	DynaSLAM	D	mean	0.070	0.837	0.040	0.884	0.044	0.816	0.042	0.870	0.056	0.672	0.037	0.833	0.074	0.867	0.046	0.738	0.066	0.856	0.026	0.462	0.114	0.434	0.042	0.830
		D	std	0.049	0.085	0.004	0.037	0.008	0.058	0.007	0.026	0.044	0.098	0.006	0.100	0.077	0.083	0.011	0.054	0.071	0.110	0.006	0.064	0.105	0.113	0.009	0.111
		DH	mean	0.011	0.091	0.013	0.097	0.012	0.094	0.007	0.086	0.012	0.096	0.008	0.090	0.016	0.099	0.015	0.090	0.015	0.099	0.020	0.079	0.014	0.079	0.011	0.097
		DH	std	0.011	0.025	0.008	0.026	0.009	0.023	0.005	0.030	0.008	0.032	0.006	0.023	0.012	0.026	0.014	0.033	0.009	0.020	0.020	0.023	0.010	0.023	0.007	0.028
		F	mean	0.789	0.523	0.676	0.408	0.592	0.426	0.587	0.371	0.843	0.593	0.581	0.450	0.692	0.447	0.851	0.590	0.803	0.400	0.858	0.484	0.486	0.260	0.858	0.440
		F	std	0.271	0.165	0.305	0.160	0.317	0.230	0.325	0.122	0.013	0.141	0.347	0.136	0.298	0.193	0.036	0.142	0.232	0.156	0.053	0.140	0.232	0.153	0.184	0.115
		FH	mean	0.193	0.987	0.250	0.999	0.177	1.000	0.274	0.995	0.211	1.000	0.219	1.000	0.304	1.000	0.209	0.999	0.242	1.000	0.054	0.455	0.215	0.946	0.258	1.000
		FH	std	0.063	0.025	0.061	0.001	0.039	0.000	0.066	0.008	0.034	0.000	0.029	0.000	0.228	0.000	0.030	0.002	0.067	0.001	0.023	0.266	0.071	0.085	0.054	0.000
		WO	mean	0.071	0.079	0.090	0.079	0.087	0.079	0.083	0.079	0.080	0.079	0.087	0.079	0.091	0.079	0.083	0.079	0.086	0.079	0.069	0.079	0.087	0.097	0.090	0.079
		WO	std	0.005	0.000	0.006	0.000	0.010	0.000	0.012	0.000	0.009	0.000	0.008	0.000	0.012	0.000	0.013	0.000	0.014	0.000	0.011	0.000	0.020	0.000	0.011	0.000
		WOH	mean	0.014	0.538	0.012	0.538	0.011	0.535	0.013	0.538	0.014	0.537	0.013	0.526	0.012	0.538	0.014	0.538	0.012	0.538	0.043	0.104	0.017	0.520	0.013	0.538
		WOH	std	0.001	0.000	0.002	0.000	0.002	0.000	0.001	0.000	0.003	0.000	0.003	0.003	0.002	0.000	0.002	0.000	0.001	0.000	0.015	0.086	0.007	0.003	0.002	0.000

Results on the TUM dataset indicate that changing the model for DynaVINS has no impact on TR. In contrast, DynaSLAM is highly influenced by the segmentation step, with many trials achieving TR on par with or better than the baseline. Surprisingly, this happens even when using lower-performing segmentation models. For example, with the model pre-trained on S-GRADE and fine-tuned on S-CH we can observe a tracking time 8.5% higher on the xyz sequence. For both SLAM frameworks, the ATE varies based on the model used. However, the best-performing networks are often not associated with the best ATE and TR couple. For example, using the model trained only with S-GRADE yields already compelling results. When using it with DynaVINS we can obtain the best performance on halfsphere sequence, with half of the ATE w.r.t. to the baseline. The results obtained using DynaSLAM with Mask R-CNN trained on S-GRADE show ATEs comparable to the baseline results but with higher tracking times, thus with better overall performance. At the same time, DynaSLAM on the rpy sequence with the model pre-trained on A-GRADE and fine-tuned on CH, that is, the best performing one according to the results shown in Table 7, significantly degrades the tracking rate of approximately 5%. Finally, when we evaluate the same SLAM methods using the different trained models on the different GRADE sequences, we find that it influences both the TR and ATE. However, as with the TUM RGB-D sequences, these results show no clear advantage in using the models with the highest performance as, unintuitively, even the ones performing poorly in the detection and segmentation tasks can attain higher TR and ATE w.r.t. the other ones. Notably, we can see from these results that even while using the baseline network both methods suffer from imperfect tracking rates which can go as low as 84.6% for DynaSLAM on the rpy sequence.

7. Conclusions

GRADE is a novel flexible solution for simulating robots in photorealistic dynamic environments enabling efficient research, development, and benchmarking of autonomous robotic systems. GRADE addresses the limitations of previous robotics and vision-focused simulation frameworks by providing a streamlined system for simulation setup and management, ground-truth data generation, offline and online robot testing, as well as benchmarking of robotics and visual-based (learning) methods in physical and photorealistic environments. This is achieved through the exploitation, integration, and expansion of Isaac Sim’s capabilities via customizable (animated) assets preparation and placement, data saving and processing procedures, and robot preparation, setup, and control.

We demonstrate GRADE flexibility by employing it in different case studies, ranging from simple visual data generation in physics-less simulations to heterogeneous multi-robot experiments managed by Active SLAM frameworks. Unlike previous systems that leveraged Isaac Sim, GRADE is not focused only on providing a closed framework for specific robots or applications, for example, benchmarking V-SLAM systems. Instead, it is built as close as possible to the low-level APIs of Isaac Sim, allowing for finer control and customization over the experiments. All the code and the data generated through GRADE for our experiments are provided as open-source for the benefit of the community.

With GRADE we provide the first method allowing for precise programmatic experiment repetition with adaptable surroundings in physics-enabled simulations. Data can now be modified or expanded in simple and effective ways after the simulation happens by, for example, changing surroundings conditions (e.g., removing or adding dynamic objects) or adding new sensors (e.g., a stereo camera). Unlike previous systems, our approach is not limited by fixing seed numbers or rigid simulation conditions. Instead, it extends beyond simple changes (e.g., lighting adjustments), enabling substantial modifications to the simulation environment. This is an important step towards flexible testing, higher robustness, and thus better generalization, helping reduce the sim-to-real gap.

The strong syn-to-real performance of the learned human detection and segmentation tasks demonstrates the effectiveness of our simulation and its visual realism. Notably, even without incorporating highly detailed human models with features like hair, shoes, or high-resolution textures, the generated data proves sufficiently realistic to significantly enhance network performance when combined with real-world images. Furthermore, training exclusively on GRADE synthetic data achieves results that closely match the baseline in indoor environments, greatly reducing the need for extensive data collection and manual annotation and clearly addressing the sim-to-real gap. While commercial synthetic clothed human models, such as RenderPeople or CLO, could further enhance realism and potentially improve training outcomes, we deliberately avoid their use to ensure the reproducibility and open redistribution of our generated data, which is essential for open research. Moreover, we want to emphasize the fact that we obtain these results without introducing or leveraging any domain-adaptation technique, as opposed to previous approaches. Using GRADE we can address the syn-to-real gap by generating a high amount of realistic and diverse data.

Then, our thorough testing on several state-of-the-art Dynamic V-SLAM methods using synthetic sequences obtained with GRADE shows how most methods fail to track sequences that are out of distribution compared to common datasets. They also highlight the necessity of reporting the average sequence tracking rate to correctly evaluate overall SLAM performance, as the ATE alone may mislead evaluations—especially in dynamic environments. Our results show how state-of-the-art methods fail either in correctly (i.e., high ATE) or completely (i.e., low ATE but low tracking rate) estimating the trajectories, despite their good performance when tested on common datasets. Moreover, our evaluations performed on TUM RGB-D and synthetic sequences using our trained detection and segmentation models exemplify the demand for thorough evaluations and studies in Dynamic V-SLAM. Indeed, using the best-performing trained networks does not always yield the best result, suggesting that more reliable feature rejection procedures and robust methods are needed. Notably, the enhanced realism and flexibility of GRADE have enabled rigorous and diverse testing of SLAM approaches in simulation. By closely mirroring real-world conditions, our framework allows for widespread and stress testing of state-of-the-art methods, exposing them to a broader range of scenarios and edge cases. As a result, even without explicitly introducing dedicated sim-to-real adaptation techniques, the improved evaluation process can inherently facilitate the smooth and effective transfer of these methods to real-world conditions by allowing researchers to evaluate adaptability and robustness beforehand.

Footnotes

Acknowledgments

The authors thank the International Max Planck Research School, Germany for Intelligent Systems (IMPRS-IS) for supporting Elia Bonetto.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the Cyber Valley Research Fund Project “WildCap” (CyVy-RF-2020-13).

ORCID iDs

Elia Bonetto

Chenghao Xu

Aamir Ahmad

Data Availability Statement

All code and data are provided as open-source at .

Notes

References

Abaspur Kazerouni

Fitzgerald

Dooly

, et al. (2022) A survey of state-of-the-art on visual slam. Expert Systems with Applications 205: 117734. DOI: 10.1016/j.eswa.2022.117734.

Abbyasov

Lavrenov

Zakiev

, et al. (2020) Automatic tool for gazebo world construction: from a grayscale image to a 3d solid model. In: 2020 IEEE International Conference on Robotics and Automation. IEEE, 7226–7232. DOI: 10.1109/ICRA40945.2020.9196621.

Anguelov

Srinivasan

Koller

, et al. (2005) Scape: shape completion and animation of people. ACM Transactions on Graphics 24(3): 408–416. DOI: 10.1145/1073204.1073207.

Bayraktar

Yigit

Boyraz

(2018) A hybrid image dataset toward bridging the gap between real and simulation environments for robotics: annotated desktop objects real and synthetic images dataset: adoreset. Machine Vision and Applications 30(1): 23–40. DOI: 10.1007/s00138-018-0966-3.

Bertiche

Madadi

Escalera

(2020) Cloth3d: clothed 3d humans. In: European Conference on Computer Vision. Springer, 344–359.

Bescos

Fácil

Civera

, et al. (2018) Dynaslam: tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3(4): 4076–4083.

Black

Patel

Tesch

, et al. (2023) BEDLAM: a synthetic dataset of bodies exhibiting detailed lifelike animated motion. In: Proceedings IEEE/CVF Conf. On Computer Vision and Pattern Recognition (CVPR), 8726–8737. IEEE.

Bonetto

Ahmad

(2023) Synthetic data-based detection of zebras in drone imagery. 2023 European Conference on Mobile Robots (ECMR). Coimbra, Portugal, pp.18,. doi: 10.1109/ECMR59166.2023.10256293.

Bonetto

Ahmad

(2024) Zebrapose: zebra detection and pose estimation using only synthetic data.

10.

Bonetto

Goldschmid

Black

, et al. (2021) Active visual SLAM with independently rotating camera. In: 2021 European Conference on Mobile Robots (ECMR), 1–8. IEEE.

11.

Bonetto

Goldschmid

Pabst

, et al. (2022) iRotate: active visual SLAM for omnidirectional robots. Robotics and Autonomous Systems 154: 104102. DOI: 10.1016/j.robot.2022.104102.

12.

Bujanca

Shi

Spear

, et al. (2021) Robust slam systems: are we there yet? In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 5320–5327.

13.

Burri

Nikolic

Gohl

, et al. (2016) The euroc micro aerial vehicle datasets. The International Journal of Robotics Research 35(10): 1157–1163. DOI: 10.1177/0278364915620033.

14.

Chang

Funkhouser

Guibas

, et al. (2015) ShapeNet: An Information-Rich 3D Model Repository. Stanford University — Princeton University — Toyota Technological Institute at Chicago.

15.

Chang

Dai

Funkhouser

, et al. (2017) Matterport3d: learning from rgb-d data in indoor environments. In: International Conference on 3D Vision (3DV), Oct. 10 2017, Qingdao, China.

16.

Chen

Yang

Mitchell

(2022) Ros-x-habitat: bridging the ros ecosystem with embodied ai. In: 2022 19th Conference on Robots and Vision (CRV), 24–31. IEEE.

17.

Deitke

VanderBilt

Herrasti

, et al. (2022) ProcTHOR: large-scale embodied AI using procedural generation. In: Oh

Agarwal

Belgrave

, et al. (eds) Advances in Neural Information Processing Systems. https://openreview.net/forum?id=4-bV1bi74M

18.

Denninger

Winkelbauer

Sundermeyer

, et al. (2023) Blenderproc2: a procedural pipeline for photorealistic rendering. Journal of Open Source Software 8(82): 4901. 10.21105/joss.04901.

19.

Downs

Francis

Koenig

, et al. (2022) Google scanned objects: a high-quality dataset of 3d scanned household items. In: 2022 International Conference on Robotics and Automation. IEEE, 2553–2560. DOI: 10.1109/ICRA46639.2022.9811809.

20.

Ebadi

Dhakad

Vishwakarma

, et al. (2022) Psp-hdri+: a synthetic dataset generator for pre-training of human-centric computer vision models. In: First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, Mar 31 2022, Maryland, USA.

21.

Farley

Wang

Marshall

(2022) How to pick a mobile robot simulator: a quantitative comparison of coppeliasim, gazebo, morse and webots with a focus on accuracy of motion. Simulation Modelling Practice and Theory 120: 102629. DOI: 10.1016/j.simpat.2022.102629.

22.

Cai

Gao

, et al. (2021a) 3d-front: 3d furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 10933–10942.

23.

Jia

Gao

, et al. (2021b) 3d-FUTURE: 3d furniture shape with TextURE. International Journal of Computer Vision 129(12): 3313–3337. DOI: 10.1007/s11263-021-01534-z.

24.

Furrer

Burri

Achtelik

, et al. (2016) Robot operating system (ROS): the complete reference (volume 1), chapter RotorS—a modular gazebo MAV simulator framework. Springer International Publishing, 595–625.

25.

Geiger

Lenz

Stiller

, et al. (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32(11): 1231–1237. DOI: 10.1177/0278364913491297.

26.

Greff

Belletti

Beyer

, et al. (2022) Kubric: a scalable dataset generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

27.

Hall

Talbot

Bista

, et al. (2022) Benchbot environments for active robotics (bear): simulated data for active scene understanding research. The International Journal of Robotics Research 41(3): 259–269. DOI: 10.1177/02783649211069404.

28.

Gkioxari

Dollar

, et al. (2017) Mask r-cnn. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE.

29.

Heming

Hang

, et al. (2020) Deep fashion3d: a dataset and benchmark for 3d garment reconstruction from single images. In: Computer Vision – ECCV 2020. Springer International Publishing, 512–530.

30.

Hornung

Wurm

Bennewitz

, et al. (2013) OctoMap: an efficient probabilistic 3d mapping framework based on octrees. Autonomous Robots 34(3): 189–206. DOI: 10.1007/s10514-012-9321-0.

31.

Hua

Pham

Nguyen

, et al. (2016) Scenenn: a scene meshes dataset with annotations. In: International Conference on 3D Vision (3DV). IEEE.

32.

Jacinto

Pinto

Patrikar

, et al. (2023) Pegasus simulator: an isaac sim framework for multiple aerial vehicles simulation. 2024 International Conference on Unmanned Aircraft Systems (ICUAS), Chania - Crete, Greece, 2024, pp. 917-922, doi: 10.1109/ICUAS60882.2024.10556959.

33.

Jocher

Ayush

Stoken

, et al. (2022) ultralytics/yolov5: v7.0 - yolov5 sota realtime instance segmentation. DOI:10.5281/ZENODO.7347926.

34.

Kamel

Stastny

Alexis

, et al. (2017) Model predictive control for trajectory tracking of unmanned aerial vehicles using robot operating system. In: Koubaa

(ed). Robot Operating System (ROS) The Complete Reference. Springer.

35.

Khanna

Mao

Jiang

, et al. (2024) Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16384-16393).

36.

Koenig

Howard

(2004) Design and use paradigms for gazebo, an open-source multi-robot simulator. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2149–2154.

37.

Kolve

Mottaghi

Han

, et al. (2017) AI2-THOR: an interactive 3D environment for visual AI. https://arxiv.org/abs/1712.05474

38.

Labbé

Michaud

(2019) Rtab-map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of Field Robotics 36(2): 416–446. DOI: 10.1002/rob.21831.

39.

Saeedi

McCormac

, et al. (2018) Interiornet: mega-scale multi-sensor photo-realistic indoor scenes dataset. In: British Machine Vision Conference (BMVC).

40.

Chen

, et al. (2021a) Hybrik: a hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3383–3393.

41.

Sang

, et al. (2021b) Openrooms: an open framework for photorealistic indoor scene datasets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7190–7199.

42.

Zhang

Wong

, et al. (2022) BEHAVIOR-1k: a benchmark for embodied AI with 1,000 everyday activities and realistic simulation. In: 6th Annual Conference on Robot Learning. https://openreview.net/forum?id=_8DoIe8G3t

43.

Lin

Maire

Belongie

, et al. (2014) Microsoft coco: common objects in context. In: Computer Vision – ECCV 2014. Springer International Publishing, 740–755.

44.

Liu

, et al. (2022) Rgb-d inertial odometry for a resource-restricted robot in dynamic environments. IEEE Robotics and Automation Letters 7(4): 9573–9580. 10.1109/LRA.2022.3191193.

45.

Loper

Mahmood

Romero

, et al. (2015) Smpl: a skinned multi-person linear model. ACM Transactions on Graphics 34(6): 1–16. DOI: 10.1145/2816795.2818013.

46.

Mahmood

Ghorbani

Troje

, et al. (2019) AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision. IEEE, 5442–5451.

47.

Michel

(2004) Webots: professional mobile robot simulation. Journal of Advanced Robotics Systems 1(1): 39–42. https://www.ars-journal.com/International-Journal-of-Advanced-Robotic-Systems/Volume-1/39-42.pdf

48.

Müller

Casser

Lahoud

, et al. (2018) Sim4cv: a photo-realistic simulator for computer vision applications. International Journal of Computer Vision 126(9): 902–919. DOI: 10.1007/s11263-018-1073-7.

49.

Mur-Artal

Tardós

(2017) ORB-SLAM2: an open-source SLAM system for monocular, stereo and RGB-D cameras. IEEE Transactions on Robotics 33(5): 1255–1262. DOI: 10.1109/TRO.2017.2705103.

50.

Noori

Portugal

Rocha

, et al. (2017) On 3d simulators for multi-robot systems in ros: morse or gazebo? In: 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR). IEEE, 19–24. DOI: 10.1109/SSRR.2017.8088134.

51.

Pavlakos

Choutas

Ghorbani

, et al. (2019) Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings IEEE Conf. On Computer Vision and Pattern Recognition (CVPR). IEEE, 10975–10985.

52.

Platt

Ricks

(2022) Comparative analysis of ros-unity3d and ros-gazebo for mobile ground robot simulation. Journal of Intelligent and Robotic Systems 106(4): 80. DOI: 10.1007/s10846-022-01766-2.

53.

Puig

Undersander

Szot

, et al. (2023) Habitat 3.0: a co-habitat for humans, avatars and robots.

54.

Pumarola

Sanchez

Choi

, et al. (2019) 3DPeople: modeling the geometry of dressed humans. In: International Conference in Computer Vision (ICCV). IEEE.

55.

Qin

Pan

Cao

, et al. (2019) A general optimization-based framework for local odometry estimation with multiple sensors. arXiv preprint arXiv:1901.03638.

56.

Quigley

Conley

Gerkey

, et al. (2009) Ros: an open-source robot operating system. In: ICRA Workshop on Open Source Software. IEEE.

57.

Raistrick

Mei

Kayan

, et al. (2024) Infinigen indoors: photorealistic indoor scenes using procedural generation. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 21783–21794.

58.

Ramakrishnan

Gokaslan

Wijmans

, et al. (2021) Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://openreview.net/forum?id=-v4OuqNs5P

59.

Roberts

Ramapuram

Ranjan

, et al. (2021) Hypersim: a photorealistic synthetic dataset for holistic indoor scene understanding. In: International Conference on Computer Vision (ICCV) 2021, Oct. 10 2021, Montreal, Canada.

60.

Roldão

de Charette

Verroust-Blondet

(2022) 3d semantic scene completion: a survey. International Journal of Computer Vision 130(8): 1978–2005. DOI: 10.1007/s11263-021-01504-5.

61.

Romero

Tzionas

Black

(2017) Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics 36(6): 1–17.

62.

Runz

Buffier

Agapito

(2018) Maskfusion: real-time recognition, tracking and reconstruction of multiple moving objects. In: 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR). IEEE, 10–20.

63.

Saini

Bonetto

Price

, et al. (2022) Airpose: multi-view fusion network for aerial 3d human pose and shape estimation. IEEE Robotics and Automation Letters 7(2): 4805–4812. DOI: 10.1109/LRA.2022.3145494.

64.

Saito

Huang

Natsume

, et al. (2019) Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 2304-2314. 2019.

65.

Santos Pessoa de Melo

Gomes da Silva Neto

Jorge Lima da Silva

, et al. (2019) Analysis and comparison of robotics 3d simulators. In: 2019 21st Symposium on Virtual and Augmented Reality (SVR). IEEE, 242–251. DOI: 10.1109/SVR.2019.00049.

66.

Saputra

MRU

Markham

Trigoni

(2018) Visual slam and structure from motion in dynamic environments: a survey. ACM Computing Surveys 51(2): 1–36. DOI: 10.1145/3177853.

67.

Savva

Kadian*

Maksymets*

Zhao

Wijmans

Jain

Straub

Liu

Koltun

Malik

Parikh

Batra

(2019) Habitat: a platform for embodied AI research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE.

68.

Scona

Jaimez

Petillot

Fallon

Cremers

(2018) Staticfusion: background reconstruction for dense rgb-d slam in dynamic environments. In: 2018 IEEE International Conference on Robotics and Automation. IEEE, 3849–3856. DOI: 10.1109/ICRA.2018.8460681.

69.

Shah

Dey

Lovett

Kapoor

(2017) Airsim: high-fidelity visual and physical simulation for autonomous vehicles. In: Field and Service Robotics. https://arxiv.org/abs/1705.05065

70.

Shao

Jarin-Lipschitz

Chaudhari

Kumar

(2024) Design and evaluation of motion planners for quadrotors in environments with varying complexities. In 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 10033–10039.

71.

Shen

Xia

Martín-Martín

Fan

Wang

Pérez-D’Arpino

Buch

Srivastava

Tchapmi

Vainio

Wong

Fei-Fei

Savarese

(2021) Igibson 1.0: a simulation environment for interactive tasks in large realistic scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE.

72.

Straub

Whelan

Chen

Wijmans

Green

Engel

Mur-Artal

Ren

Verma

Clarkson

Yan

Budge

Yan

Pan

Yon

Zou

Leon

Carter

Briales

Gillingham

Mueggler

Pesqueira

Savva

Batra

Strasdat

Nardi

Goesele

Lovegrove

Newcombe

(2019) The Replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797.

73.

Sturm

Engelhard

Endres

Burgard

Cremers

(2012) A benchmark for the evaluation of rgb-d slam systems. In: Proceedings of the International Conference on Intelligent Robot Systems. IEEE.

74.

Wang

Liu

(2023) Deepcloth: neural garment representation for shape and style editing. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(2): 1581–1593. DOI: 10.1109/TPAMI.2022.3168569.

75.

Szot

Clegg

Undersander

Wijmans

Zhao

Turner

Maestre

Mukadam

Chaplot

Maksymets

Gokaslan

Vondrus

Dharur

Meier

Galuba

Chang

Kira

Koltun

Malik

Savva

Batra

(2021) Habitat 2.0: training home assistants to rearrange their habitat. Proceedings of the 35th International Conference on Neural Information Processing Systems. Curran Associates Inc. p. 16.

76.

Talbot

Hall

Zhang

Bista

Smith

Dayoub

Sunderhauf

(2020) Benchbot: evaluating robotics research in photorealistic 3d simulation and on real robots. https://api.semanticscholar.org/CorpusID:220935709

77.

Varol

Romero

Martin

Mahmood

Black

Laptev

Schmid

(2017) Learning from synthetic humans. In: The IEEE / CVF Computer Vision and Pattern Recognition Conference. IEEE.

78.

Wang

Bovik

Sheikh

Simoncelli

(2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing: A Publication of the IEEE Signal Processing Society 13(4): 600–612. DOI: 10.1109/TIP.2003.819861.

79.

Wang

Scherer

(2020) Tartanvo: a generalizable learning-based vo. Conference on Robot Learning. PMLR, 1761–1772.

80.

Wang

Yuan

Luo

Xie

Lin

Iqbal

Fidler

Khamis

(2023) Learning human dynamics in autonomous driving scenarios. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE, 20796–20806.

81.

Wang

Guo

Rong

Grigorev

Song

Zarate

Hilliges

(2024a) 4d-dress: a 4d dataset of real-world human clothing with semantic annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

82.

Wang

Xian

Chen

Wang

Fragkiadaki

Erickson

Held

Gan

(2024b) Robogen: towards unleashing infinite data for automated robot learning via generative simulation. In: Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org.

83.

Kirillov

Massa

Girshick

(2019). Detectron2. https://github.com/facebookresearch/detectron2

84.

Chen

Shi

Alonso-Mora

(2024) Decentralized multi-agent trajectory planning in dynamic environments with spatiotemporal occupancy grid maps. In 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 7208–7214. 10.1109/ICRA57147.2024.10610670.

85.

Xia

R Zamir

Sax

Malik

Savarese

(2018) Gibson env: real-world perception for embodied agents. In: Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on. IEEE.

86.

Xiu

Yang

Cao

Tzionas

Black

(2023) ECON: explicit clothed humans imized via normal integration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE.

87.

Bazavan

Zanfir

Freeman

Sukthankar

Sminchisescu

(2020) Ghum & ghuml: generative 3d human shape and articulated pose models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 6184–6193.

88.

Bonetto

Ahmad

(2025) Dynapix slam: a pixel-based dynamic visual slam approach. In: Cremers

Lähner

Moeller

Nießner

Ommer

Triebel

(eds) Pattern Recognition. Springer Nature Switzerland, 168–184.

89.

Yang

Cai

Mei

Liu

Chen

Xiao

Wei

Qing

Wei

Dai

Qian

Lin

Liu

Yang

(2023) Synbody: synthetic dataset with layered human models for 3d human perception and modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 20282–20292.

90.

Zhang

Zhen

Stevenson

(2020) A dataset for deep image deblurring aided by inertial sensor data. Electronic Imaging 2020: 379–1–379. DOI:doi:10.2352/ISSN.2470-1173.2020.14.COIMG-379.

91.

Zheng

Zhang

Tang

Gao

Zhou

(2020) Structured3d: a large photo-realistic dataset for structured 3d modeling. In: Proceedings of the European Conference on Computer Vision (ECCV). IEEE.

92.

Zhou

Zhang

Chen

Shen

(2021) Fuel: fast uav exploration using incremental frontier structure and hierarchical planning. IEEE Robotics and Automation Letters 6(2): 779–786.

93.

Zuffi

Kanazawa

Jacobs

Black

(2017) 3D menagerie: modeling the 3D shape and pose of animals. In: IEEE Conf. On Computer Vision and Pattern Recognition (CVPR). IEEE.

GRADE: Generating Realistic and Dynamic Environments for robotics research with Isaac Sim

Abstract

Keywords

1. Introduction

2. Related works

2.1. Robotics simulators

2.2. Indoor environments datasets

2.3. Simulated animated humans

3. Materials and methods

3.1. (Non-)Rigid assets preparation and placement

3.1.1. Environments

3.1.2. Objects

3.1.3. Pre-animated assets

3.1.4. Asset placement

3.2. Robot creation and control

3.2.1. Creation

3.2.2. Control

3.3. Simulation management

3.4. Post-processing tools

4. GRADE case studies

4.1. ROS-free simulation in a savanna environment

4.2. (Multi-)Robots and active SLAM

4.3. Experiment repetition and enhancement

5. Data generation

5.1. Summary of released data and code

6. Results

6.1. Experiment repetition evaluation

6.2. Syn-to-real transfer learning

6.2.1. Human detection with YOLOv5

6.2.2. Human detection and segmentation with mask R-CNN

6.3. Dynamic Visual SLAM

6.3.1. Visual SLAM performance

6.3.2. Dynamic V-SLAM and deep learning relation

7. Conclusions

Footnotes

Acknowledgments

Declaration of conflicting interests

Funding

ORCID iDs

Data Availability Statement

Notes

References