Abstract
The exponential growth of global data demands unprecedented reliability and efficiency of data center operations. Traditional manual inspection methods remain inefficient and error-prone. This article presents an autonomous mobile robot enhanced by large language models and hierarchical three-dimensional scene graphs for intelligent operation and maintenance. The core innovation lies in enabling semantic-aware navigation, allowing the robot to interpret high-level instructions like “inspect servers with the red alert light on in cold aisle B” by leveraging the three-dimensional scene graph for spatial-semantic reasoning and the large language model for task parsing. Deployed in multiple ultra-large data centers, the system has improved inspection efficiency by over 50%, autonomously patrolled over 10,000 km, and identified more than 1200 equipment anomalies. This work demonstrates a significant step towards fully autonomous, intelligent infrastructure management.
Keywords
Introduction
The rapid expansion of digital services, cloud computing, and edge applications has significantly increased the scale and complexity of data centers. These facilities now house tens of thousands of servers, storage systems, and network devices, requiring continuous monitoring and maintenance to ensure high availability, energy efficiency, and operational safety.1,2 Traditional operation and maintenance (O&M) approaches, relying heavily on manual inspections and human intervention, are increasingly inadequate. They are labor-intensive, error-prone, and unable to meet 24/7 uptime demands. In response to these challenges, autonomous mobile robots have emerged as a promising solution to augment or replace human operators in performing repetitive, hazardous, or precision-critical tasks.3,4 Equipped with multi-modal sensors such as RGB cameras, thermal imagers, and LiDAR (Light Detection and Ranging), these robots navigate complex indoor environments, collect data, and detect anomalies in real time.5–7 However, most existing robotic systems for data centers are limited in scope—focusing on specific tasks such as temperature monitoring,2,8 asset tracking, 9 or simple anomaly detection.1,6 Their key limitation is the lack of a unified and intelligent framework capable of understanding high-level semantic instructions and autonomously planning complex tasks. They navigate to pre-defined coordinates rather than interpreting commands like “go and do something.”
This article addresses this gap by presenting the design, implementation, and deployment of a large language model (LLM)-enhanced autonomous robot system tailored for intelligent O&M in large data centers. A key innovation is the integration of a hierarchical three-dimensional (3D) scene graph (3DSG) that provides rich spatial-semantic understanding of the environment, enabling the robot to reason about objects, regions, and their relationships. This is combined with an LLM that allows the robot to parse natural language instructions and convert them into actionable navigation as well as inspection plans. Our system supports a wide range of O&M tasks, including automated equipment inspection, environmental monitoring, and asset inventory management. It has been deployed across multiple data centers covering over 12,000 devices, accumulating more than 10,000 km of autonomous patrol mileage. To date, it has identified over 1200 equipment anomalies, demonstrating a more than 50% improvement in inspection efficiency compared to manual methods. This work represents a significant advancement toward fully autonomous, intelligent infrastructure management, with implications for future smart data centers and unmanned operations.
Related work
Robotic systems in data centers
The use of robotics in data centers has gained traction over the past decade, 10 primarily for environmental monitoring and asset management.2,8 Early work by IBM researchers introduced mobile robots for thermal mapping and anomaly detection. 1 These systems used basic sensors and navigation modules to collect temperature data and identify hotspots. Similarly, Mansley et al. 5 presented a mobile robotic system for temperature and humidity sampling, using interpolation methods to reconstruct thermal maps and identify cooling inefficiencies. While effective, these systems were limited to reactive monitoring and lacked semantic understanding of the environment. More recent efforts have explored cloud-based robotic platforms for scalable monitoring.6,11,12 For instance, the authors of Terrissa et al. 6 proposed a cloud robotics architecture using ROS (robot operating system) for autonomous navigation and environmental sensing. Their system allowed remote users to dispatch robots to specific rooms for data collection. However, these platforms typically rely on predefined waypoints and lack the ability to interpret high-level task descriptions or adapt to dynamic instructions.
Navigation and scene understanding
Accurate navigation and environment mapping are critical for autonomous operation in complex indoor environments. Traditional SLAM (simultaneous localization and mapping) techniques have been widely used in data center robots,13,14 but they often produce geometric maps that lack semantic information. Recent advances in 3D scene representation, such as 3DSGs, offer a promising direction by encoding both spatial and semantic relationships between objects and regions.15–17 These hierarchical representations enable robots to reason about their environment at multiple levels of abstraction, facilitating more intelligent task planning and execution.
Human–robot interaction and task planning
Most existing robotic systems for data centers rely on pre-programmed tasks or low-level command interfaces.6,10,18 This limits their flexibility and usability, especially in dynamic environments where tasks may change frequently. The integration of natural language processing (NLP) and LLMs into robotic systems has recently shown great potential for bridging this gap. LLMs can interpret ambiguous or context-rich instructions and translate them into structured task plans. 19 This capability is especially valuable in data center settings, where operators may issue commands such as “check the overheating switch in row 5” without specifying exact coordinates or device IDs.
Asset tracking and anomaly detection
Several studies have focused on automating asset tracking and fault detection using mobile robots. 10 For example, Nelson et al. 9 demonstrated a vision-based robot capable of reading LED indicators and tracking assets using active radio frequency identification (RFID) tags. Others have used thermal imaging and optical cameras to detect anomalies such as overheating or equipment failures.2,8 While these systems improve upon manual inspection, they often operate in isolation and lack integration with broader facility management systems.
System integration and deployment at scale
Few studies have reported large-scale, real-world deployments of autonomous robots in operational data centers. Most prototypes are tested in controlled lab environments or small-scale server rooms.5–7 The scalability of hardware, software, and data processing pipelines remains a significant challenge. Our work addresses this gap by presenting a fully integrated system deployed across multiple data centers, supporting thousands of devices and extensive autonomous operation.
System design of the data center intelligent O&M robot
The robot is designed as a modular system, integrating hardware for environmental interaction and software for intelligent decision-making, aligned with service robot design principles for industrial environments. A visual overview of the robot platform is provided in Figure 1.

Illustration of the autonomous mobile robot platform for intelligent data center operation and maintenance, showing its modular design including the sensor suite, robotic arm, and mobility base.
Hardware architecture
The inspection robot system is built upon a self-developed hardware platform, designed for high-precision autonomous operation in complex and equipment-dense server rooms. The hardware comprises three core modules: a differential-drive chassis for mobility, a 6-degree-of-freedom collaborative robotic arm for flexible manipulation, and a multi-modal sensor suite for environmental perception.
Main control unit
The system utilizes a high-performance X86 architecture industrial computer as the central processing core, running the ROS along with relevant navigation and positioning algorithm modules. The main control unit is responsible for multi-source sensor data fusion, task scheduling and management, robotic arm motion planning and control, inspection procedure execution, data communication with the cloud operations platform, and docking control with charging stations, enabling fully intelligent workflow.
Motion unit
The two-wheel differential-drive chassis with hub motors enables flexible movement through narrow server aisles. High-precision closed-loop speed control ensures stability. The motion control board communicates in real-time with the main control unit via a CAN bus, receiving linear and angular velocity commands and providing feedback on the actual operational status of the chassis. Safety features include impact-sensitive strips for immediate stop on collision. Emergency stop buttons are located on both sides of the body. When pressed, they simultaneously lock the robotic arm and the drive wheels.
Sensor unit
To meet the requirements for navigation, identification, and monitoring, the robot integrates various sensors to form a multi-layer perception system:
LiDAR (light detection and ranging) and IMU (9-axis inertial measurement unit): Two single-line LiDAR sensors are configured in a diagonal arrangement, enabling 360 RGB-D and HD cameras: Deployed at the end of the robotic arm or on the front of the chassis, these visual sensors support visual measurements collecting, real-time video stream transmission, and scene perception. Industrial camera: Mounted on the end of the robotic arm, and combined with AI vision algorithms, it enables image enhancement and intelligent identification of equipment indicator light status, meter readings, and scenarios involving mesh occlusion. Infrared thermal imager: Also located at the end of the robotic arm, it is used to capture the surface temperature distribution of equipment like servers.
Robotic arm module
The robot is equipped with a 6-degree-of-freedom collaborative robotic arm installed on the top structure of the body. It possesses flexible operation and detection capabilities, allowing it to perform the following tasks:
Close-range image capture and visual identification. High-power RFID tag reading and equipment identification. Multi-modal equipment diagnosis (visual + infrared) by combining thermal imaging and visible-light cameras.
The aforementioned hardware modules are integrated through unified electrical and communication interfaces. Under the coordination of the main control system, they work together to support the comprehensive inspection functions of the robot in complex industrial scenarios.
Software architecture
The software architecture of the robot is designed for robust autonomy and is built upon a core onboard software system running on the ROS. This system integrates several key modules that execute directly on the X86 industrial computer of the robot. The navigation stack combines LiDAR-based SLAM 13 with real-time path planning and dynamic obstacle avoidance, enabling precise movement through narrow server aisles. The multi-modal perception module is responsible for synchronizing and processing data streams from the cameras, thermal imager, and LiDAR. It runs specialized AI models for tasks such as equipment status recognition (e.g. LED state classification) and visual meter reading. Central to the intelligence of the system is the semantic perception module, which maintains a local instance of the hierarchical 3DSG, performing real-time object recognition and spatial reasoning to understand environmental context. Finally, the task execution engine sequences and manages the behaviors of the robot, orchestrating navigation commands, robotic arm maneuvers, and inspection routines to complete assigned patrols and tasks autonomously.
To address the demands of large-scale deployments, this core system can be seamlessly upgraded to a premium cloud-edge-end architecture. This enhanced configuration is implemented upon customer request or in scenarios requiring centralized management and advanced computational services. In this setup, the end layer consists of the robots themselves, performing the aforementioned core functions. The edge layer, deployed locally within each data center zone, hosts low-latency services such as a local task execution engine and real-time anomaly filtering, which reduces bandwidth usage and ensures operational responsiveness. The cloud layer then provides high-level services, including the LLM-based instruction parser for natural language tasking, the central 3DSG engine for maintaining a unified spatial-semantic model across the entire facility, and large-scale data analytics for long-term trend analysis and model retraining. This scalable architecture ensures that the system can evolve from a standalone intelligent agent to a fully integrated, fleet-wide management solution for the most complex operational environments.
This modular hardware and software architecture provides the foundation for the core intelligent behaviors of the robot. The following section details the key technologies that enable these capabilities, focusing on the hierarchical 3DSGs for environmental understanding and the LLM for high-level task reasoning.
Key technologies for the data center intelligent O&M robot
Hierarchical 3D scene graphs for spatial-semantic perception
A holistic understanding of the complex and critical data center environment is fundamental to enable autonomous inspection and interaction of the robot. To address the limitations of traditional geometric mapping and enable context-aware inspection tasks, we design a data center-adapted hierarchical 3DSG representation that seamlessly integrates geometric, semantic, and relational data. This approach is inspired by our previous work, 15 but is specifically tailored to address the unique challenges of data centers, such as the prevalence of structurally similar server racks, the importance of functional areas (e.g. hot aisle/cold aisle), as well as the need to identify specific assets for maintenance tasks and to respond to task-specific semantic requirements (e.g. server status monitoring and cabinet asset association). The 3DSG encodes both geometric precision and semantic context across hierarchical layers, enabling the robot to reason about its environment from individual devices to the entire data center.
Overall architecture of data center 3DSG
The hierarchical 3DSG for data center inspection is formally defined as

Block diagram illustrating the generation process of the hierarchical three-dimensional scene graph (3DSG) from raw multi-modal sensor data to structured spatial-semantic layers.
Layer-by-layer construction for data center environments
As the bottom-most layer,
The semantically labeled point cloud is integrated into a truncated signed distance field (TSDF) via ray-casting,15,16 generating a voxel-based map optimized for data centers. This map encodes free space by marking accessible paths and restricted zones to guide the robot’s path planning. Moreover, it includes semantic voxel labeling where each voxel stores probabilistic semantic labels to handle occlusions common in dense equipment environments. The voxel map is then converted into a 3D mesh via the marching cubes algorithm.15,16
As depicted in Figure 2, each object node is augmented with two descriptors, namely a geometric descriptor and a visual-semantic descriptor to support semantic reasoning. A geometric descriptor is a fused point cloud
To enrich nodes with task-critical semantics, a large vision language model (LVLM) and LLM generate Device state and operational status derived from visual cues, for example, “Server-SN-12345: red LED lit (alarm state),” “PDU 2 (power distribution unit): voltage 220 V (normal).” Spatial predicates, mainly include relationships with other devices, for example, “Server-SN-12345 is located in Cabinet-B12-05, 3U position,” “Switch-SW-678 is mounted above Server-SN-12345.” Affordances, indicating action possibilities for inspection, for example, “Cabinet-B12-05: door can be opened by robotic arm,” “LED of Server-SN-12345 is visible from aisle 1 without occlusion.” Asset metadata, linked to DCIM via API, for example, “Server-SN-12345: last maintenance 2024-03-15; responsible team: Infrastructure Group 2.”
Attributes are updated in real time. When the robot re-inspects a device, new LVLM observations are fed to the LLM to refresh
Data centers are organized into functional areas rather than traditional rooms, forming
For large data center campuses, upper layers represent entire floors and buildings. These nodes are formed by clustering areas with similar heights and are annotated by the LLM based on the constituent area nodes. This provides a scalable representation essential for campus-wide tasking and logistics.
It is worth noting that this 3DSG representation is fundamentally different from traditional point cloud or occupancy grid maps. While those capture only geometry, the hierarchical structure of the 3DSG embeds semantic knowledge which forms the fundamental knowledge base that the LLM can reason over.
Data structure and computational analysis
To enable seamless interaction with the robot’s LLM-based task parser described in the next section and data center management systems, the 3DSG is represented as a NetworkX graph serialized in JSON format.15,16 Key extensions for data centers include DCIM compatibility. Each node includes a “dcim-id” field, for example, “dcim-id: Server-SN-12345” to link to the DCIM system, enabling queries like “retrieve maintenance history of Server-SN-12345.” Furthermore, nodes are tagged with task relevance such as “alarm-prone” and “high-priority” to prioritize patrol routes. Figure 6 provides an example of the resulting hierarchical 3DSG data structure, illustrating the nodes (e.g. floors, rooms, and servers) and their attributes such as server status and inspection history. It is described in detail in the subsequent section holding experimental results.
The construction and maintenance of the hierarchical 3DSG involve several computational stages, each with distinct resource requirements and scaling characteristics, as summarized in Table 1 measured on a per-frame basis during deployment (The main computing unit is a processor with a 11th Gen Intel Core i9-11900 (8 cores/16 threads, base frequency of 2.50 GHz), 32 GB memory, and a NVIDIA Quadro RTX 4000 graphic card.). Note that LLM and LVLM computations are offloaded to cloud APIs, imposing negligible local resource burden regardless of data center scale. The hierarchical architecture enables favorable scaling through two key mechanisms. First, incremental updates ensure that when new assets are added, only affected local regions require recomputation rather than the entire graph. As the 3DSG is stored as a structured JSON file mentioned above, updates are only limited to node/attribute modifications, minimizing computing overhead for large-batch asset additions. Second, the asynchronous module design decouples processing stages: Layer 1 operations run continuously during deployment, while Layer 2 processing can be scheduled flexibly. Consequently, doubling asset count increases total *computational load by less than 100%, demonstrating sub-linear scaling.
Computational performance of 3DSG construction stages.
3DSG: three-dimensional scene graph; TSDF: truncated signed distance field; CPU: central processing unit; GPU: graphics processing unit.
3DSG-enabled multi-modal perception fusion for anomaly detection
The efficacy of data center operations relies on the availability of comprehensive, contextualized, and accurate observational data. The primary function of our robotic system is to serve as a highly efficient mobile sensing platform, systematically gathering rich multi-modal data from the environment. The core innovation that enables this is the integration of diverse sensory data within the unifying framework of the hierarchical 3DSG. This approach transforms raw, high-volume sensor streams into a structured, spatially grounded, and semantically annotated information base, which is then made available for review and analysis by data center operators.
The hierarchical 3DSG acts as the central repository that organizes these disparate data streams. Instead of storing sensor readings as isolated data points, the system fuses them as dynamic, time-stamped attributes of the relevant nodes within the graph. This spatial and semantic grounding is the key advantage. For instance, a thermal image is not just a picture; it is intrinsically linked to the specific server rack node in the object layer of the graph. Similarly, a visual observation of the color of an indicator light is attached directly to the corresponding server node, and a reading from an analog meter is associated with its power distribution unit node.
The outputs from specialized AI perception modules—such as those for cabinet segmentation, signal light identification, and meter reading—are used to populate this unified graph with structured observations. These modules do not function as autonomous anomaly detectors but as sophisticated data annotators. For example, a model identifies a “red LED” state, and this observation is recorded as a semantic attribute of the corresponding device node. Similarly, a meter reading is digitized and stored as a numerical value associated with its node.
This integrated approach, centered on the 3DSG, culminates in a powerful data collection and presentation system. It empowers human operators with a holistic, multi-modal view of the state of the data center. Instead of sifting through disconnected alarm logs and video footage, operators are presented with a semantically searchable and spatially organized knowledge base. This significantly reduces the cognitive load in situational assessment, allowing them to make more informed and timely decisions based on a comprehensive set of contextualized observations collected by the robot.
LLM-enhanced intelligent navigation and task planning
The hierarchical 3DSG described in the previous section serves as a rich, structured knowledge base of the data center environment. To translate this static representation into dynamic, intelligent behavior, we leverage LLMs as a core reasoning engine. This LLM-enhanced approach allows the inspection robot to interpret complex natural language commands, perform semantic searches over the 3DSG, and generate logically sound navigation and task plans, moving beyond simple waypoint following to true context-aware autonomy.
From 3DSG to executable plans
Our navigation and task planning framework, illustrated in Figure 3, follows a two-stage process inspired by SayPlan 19 and Cheng et al. 21 but tailored for industrial inspection tasks. It consists of a semantic search stage and an action planning stage.

Framework of the large language model (LLM)-enhanced intelligent navigation and task planning system, showing the two-stage process from natural language instruction to executable robot actions.
Given a high-level task instruction, for example, “Inspect the PDU in the server room adjacent to cold aisle A3,” the LLM is first tasked with a semantic search over a collapsed version of the 3DSG, as shown in Figure 3. This simplified graph contains only higher-level nodes. Then the LLM reasons about the task requirements to identify a relevant subgraph, a subset of the full 3DSG containing the nodes necessary to solve the task. For a data center, this involves understanding functional relationships, such as which server room contains the specified aisle and which PDU is associated with that room. Consequently, instead of processing the entire 3DSG at once, only the nodes belonging to the identified relevant subgraph pertinent to the task are provided in detail to the LLM for the action planning stage. This retrieval-augmented approach is key to applying our system in large-scale data centers.
Once the relevant subgraph is identified, the LLM is prompted to generate a sequence of actionable steps. This includes navigational actions such as “goto_room,” “goto_aisle,” “goto_node” and inspection-specific actions such as “scan_rack_label,” “check_temperature_sensor,” and “inspect_PDU_status.” The plan is grounded in the spatial relationships and node attributes encoded in the 3DSG, ensuring feasibility.
Prompt engineering and scoring mechanism for reliable data center operations
The reliability of LLM-based planning is crucial in a critical environment like a data center. We employ several strategies to enhance robustness.
First, the prompt templates are carefully engineered, as illustrated in Figure 7 of the section presenting performance evaluation results. It can be seen that the prompt template consists of several core components: a system role definition, that is, “You are a data center O&M robot agent,” the current 3DSG context, the task instruction from the user, a set of allowed actions, strict output format constraints, and, crucially, examples of successful task decomposition. This design effectively constrains the general reasoning capabilities of the LLM to the specific domain of robot task planning, significantly improving the stability and correctness of the generated plans.
To mitigate the risk of the hallucination issues common for LLMs, we employ a scoring mechanism 15 where the LLM double-checks the rationality of its proposed action sequences. For instance, specifically tailored for the data center inspection scenario, before generating a plan to “inspect the backup PDU,” the LLM can be queried to validate if the target PDU node indeed has an attribute like “function: backup.” This self-validation step enhances the credibility of the generated plans.
Tailoring for data center inspection
The general LLM-based planning framework is specifically adapted to meet the stringent demands of data center operations, ensuring both the performance and the practicality. This tailoring is achieved through several key adaptations. First, the action lexicon of the robot is composed of task-oriented verbs that directly support its inspection mission, such as “record_thermal_image” and “verify_power_led,” moving beyond generic navigation commands to enable precise operational tasking. Furthermore, the reasoning capabilities of the LLM are honed to understand data center-specific functional layouts. It is prompted to comprehend the logical relationships between critical areas, such as the airflow sequence from cold aisles to server racks and then to hot aisles. This spatial understanding enables the generation of highly efficient inspection paths. For example, planning a complete thermal survey route that minimizes redundant travel. Finally, the inherent scalability of the hierarchical 3DSG is fully leveraged. The system seamlessly scales from managing a single server room to encompassing an entire data center hall or multi-building campus. The LLM can efficiently narrow its semantic search to a specific floor or functional zone within the graph, ensuring the planning process remains robust and computationally manageable even in vast, complex environments.
Deployment and performance validation
The deployment of the proposed autonomous robot spans a number of data centers featuring diverse structures with altogether over 12,000 devices, covering a total floor area of 50,000 m
Deployment of the autonomous robot in a large data center
This section describes the deployment of the autonomous robot in a large, multi-room data center facility of a client of ours. The data center comprises three server rooms, each containing approximately 200 server cabinets. The facility required daily inspection across two shifts to ensure operational stability and asset security. Key challenges included the high density of equipment, the need for accurate asset tracking, and the labor-intensive nature of manual inspections.
The deployment process began with the robot performing an initial mapping round. Utilizing its LiDAR-inertial odometry pipeline and fine-tuned perception models, the robot constructed a centimeter-accurate geometric map while simultaneously segmenting and classifying critical objects. This raw perceptual data was then fused into the hierarchical 3DSG, establishing the layers as presented in previous sections. The semantically annotated voxel map of Layer-1 enabled collision-free navigation, while Layer-2 instantiated nodes for individual devices, enriching them with geometric and visual-semantic descriptors. An LLM was subsequently employed to annotate Layer-3 functional areas based on the objects contained within them, creating a comprehensive spatial-semantic model of the environment.
With the 3DSG serving as a dynamic knowledge base, the system demonstrated its capability for intelligent task execution. Operators could issue high-level natural language commands, which the LLM-based instruction parser would interpret. The parser performed a semantic search over the 3DSG to identify the relevant subgraph and reason about the relationships between areas and equipment. It then generated a grounded, executable plan comprising a sequence of navigational and inspection-specific actions.
During task execution, the robot autonomously navigated to the specified locations using the 3DSG for context-aware path planning. Upon reaching a target, it leveraged its multi-modal sensors to perform the required inspection actions. The robotic arm positioned the thermal imager and high-resolution camera to capture data. The entire process, from navigation to sensor deployment, is monitored in real-time through our cloud-edge platform, as illustrated in Figure 4 (The online platform used by the robot system is currently implemented in Chinese, as all deployments of the robot to date have been domestic. For the purpose of this article, English annotation has been included in the relevant example outputs presented in this section and in the Appendix.). Real-time sensor readings were processed by onboard AI modules, and the results—such as temperature readings and equipment statuses—were immediately fused back into the corresponding nodes in the 3DSG as updated time-stamped attributes.

Real-time monitoring interface of the cloud-edge platform, displaying the current task execution of the robot, navigation status, and sensor data streams during an inspection patrol.
This integrated workflow yielded significant operational results. The robots successfully identified a substantial number of equipment anomalies, such as overheating components and faulty indicator states, which were immediately flagged in the system for operator review. A screenshot exemplifying the final output, presenting structured anomaly reports and multi-modal data (e.g. thermal images and alarm light status) for operator review, is shown in Figure 5 (cf. the Appendix for more examples of the inspection results). The inspection frequency increased dramatically compared to manual methods. Prior to robot deployment, manual inspections were conducted twice per day. With the autonomous robot system, inspection frequency increased to a maximum of 12 times per day, representing a 600% increase in patrol coverage. This ensures more timely anomaly detection and continuous environmental monitoring. Furthermore, the automation of data collection and its direct integration into the semantically structured 3DSG eliminated manual report compilation, drastically reducing administrative overhead. Previously, compiling monthly statistical reports took approximately 4 days of manual effort. With the automated system, reports are now generated on-demand and in real-time, effectively reducing report preparation time by over 75%. Define the inspection efficiency gain as the percentage of reduction in time required to complete an inspection route and to compile the report compared to manual inspection, averaged over multiple trials across different data centers:

Representative inspection results output, showing structured anomaly reports with corresponding thermal images and visual evidence for operator review (bottom).
LLM-enhanced task planning-based 3DSGs
To quantitatively evaluate the performance of our LLM-enhanced navigation and task planning system, we conducted extensive experiments across three representative data center scenarios. An example of the data structure of the hierarchical 3DSGs is depicted in Figure 6 (A visualization of the 3DSG in the form of a semantic-annotated mesh cannot be included in this article due to confidentiality agreements and the sensitive nature of the data centers. Figure 6 provides a schematic representation of the data structure instead.). It expands from the “Floor” layer through the “Room” layer down to the “Device” layer. Each room node and server equipment node in the 3DSG has its own unique node attributes. Room node attributes include the room name, serial number, location, belonging floor, room area, and node information. The location information is obtained from geometric mapping, and the node information is extracted by the LLM based on the characteristics of the room. Server node attributes include the device name, serial number, location, belonging room serial number, DCIM serial number, detection history record, latest detection details, and voltage status. Among them, the detection history record logs the inspection records of this server, where “0” represents normal operation and “1” represents abnormal operation. The latest detection details include the anomaly level and processing status from the last inspection of this device. The first element is the anomaly level: “1” for minor warning, “2” for serious warning, and “3” for critical warning. The second element is the processing status of the anomaly: “0” for unresolved and “1” for resolved, updated in real-time. On the other hand, “

Schematic representation of the hierarchical three-dimensional scene graph (3DSG) data structure for data center environments, illustrating node relationships and attributes across floor, room, and device layers.
The prompt structure for the task planning used in the experimental setup is illustrated in Figure 7. It encompasses the role of the agent, the task instruction, the 3DSG, the robot location, inspection action commands, the output format, and an example. The agent role defines the inspection tasks for the data center O&M robot agent. The task instruction is the inspection command issued by the user. The location is the real-time position of the data center O&M robot, obtained by a re-localization algorithm. The inspection action commands showcase the inspectable actions that the LLM can plan, including going to a room or floor, approaching server equipment, scanning device labels, checking temperature, detecting power, and checking PDU status. The output format defines the specification for the output of the LLM, containing step-by-step inspection action commands and a chain of thought. The example provides a sample inspection task to guide the task reasoning of the LLM.

Example prompt structure used for LLM-based task planning, showing the components including agent role, task instruction, 3DSG context, and output format specification. LLM: large language model; 3DSG: three-dimensional scene graph.
A key aspect of our validation was to assess the impact of enriched semantic node information on the planning reliability of the LLM. As summarized in Table 2, for each test scenario, the LLM was leveraged to analyze the 3DSG and generate distinctive node information for each room or area, such as the number of servers and their historical inspection patterns. This high-level summary enabled the LLM to reason about user intent and identify the most relevant areas for a given task.
Room node information generated by the large language model (LLM) for the three experimental scenarios, showing server counts and historical inspection patterns used for semantic reasoning.
The semantic search and task planning results across the three scenarios are presented in Figures 8 to 10, respectively.

Semantic search and task planning results for Scenario 1, comparing planning performance with and without access to semantic node information for batch inspection and historically contextual inspection commands.

Semantic search and task planning results for Scenario 2, demonstrating the capability of the system to handle both straightforward and historically contextual inspection commands.

Semantic search and task planning results for Scenario 3, showing robust performance in navigating to specific server locations and executing detailed inspection routines.
In Scenario 1, as shown in Figure 8, for a batch inspection command across the entire area, the LLM planner without access to node information frequently required multiple iterative searches or failed entirely by targeting incorrect rooms. In contrast, when provided with node information, the LLM swiftly identified the room most associated with the command of the user and generated a correct, efficient sequence of inspection actions. In Scenario 2 (cf. Figure 9), while the LLM could handle straightforward commands with or without node information, the value of semantics became critical for specific instructions. For instance, commands targeting servers with particular historical records could only be successfully planned and executed when the LLM could reference the relevant node attributes to locate the correct devices. Finally, in Scenario 3 corresponding to Figure 10, the robot consistently succeeded in all tasks by navigating to specific server locations and performing detailed inspections, demonstrating the robustness of the system when fully leveraging the semantically rich 3DSG.
These experiments conclusively demonstrate that the integration of the hierarchical 3DSG with the LLM-based planner is crucial for achieving reliable, context-aware autonomy. The semantic node information directly enables the understanding of complex, ambiguous user commands, allowing the system to fulfill diverse inspection needs that would be infeasible with a purely geometric map or a planner lacking semantic reasoning capabilities.
Furthermore, we compare our proposed hierarchical 3DSG-based method with a single-layer scene graph-based baseline.
17
Using the data center scenario shown in Figure 8, we have conducted experiments with the first two representative task instructions in this example. Each method executed each instruction 50 times. We measured three metrics:
Average token consumption: Reflecting computational efficiency and LLM usage cost. Success rate: Measuring task completion reliability. Average LLM processing time: Indicating response latency.
The corresponding results are summarized in Table 3. It can be seen that the proposed method consumes significantly fewer tokens (79% reduction for Instruction 1, 46% reduction for Instruction 2) thanks to the hierarchical structure that enables iterative subgraph retrieval. Consequently, only relevant nodes are expanded for detailed reasoning, avoiding processing the entire graph. This feature renders the proposed approach especially suitable for large-scale data center scenarios. On the other hand, our method achieves substantially higher success rates as shown in Table 3.
Performance comparison between the proposed hierarchical three-dimensional scene graph (3DSG)-based method and a single-layer scene graph-based baseline.
The hierarchical representation allows the LLM to perform targeted semantic search across layers, whereas the single-layer scene graph overwhelms the LLM with flat and unstructured node relationships, leading to confusion and planning failures. Although due to the multi-round iterative search process, the proposed method requires slightly higher average processing time, this acceptable trade-off yields dramatically improved reliability and token efficiency. The response latency is totally acceptable for the execution of data center inspection tasks and is much lower compared to the case where inspection spots can only be identified manually by searching through several sources of data including the data center layout and history inspection records, and then the resulting inspection tasks have to be assigned to the robot via further programming taking spatial information into account.
Failure mode analysis and safety considerations
In this subsection, we conducted a systematic analysis of failure modes observed during 12 months of deployment across multiple data centers. This analysis covers over 1000 task executions. Here we categorize failures into three primary categories according to their origin. Perception failures, occurring when the robot misdetects or misclassifies objects, account for
Summary of failure cases.
Since planning failures are most directly relevant to the LLM-based components of our system, we provide a detailed breakdown of their subcategories. Action sequencing errors, where operations are performed in the wrong order, represent the most common planning failure at 1.6% of complex tasks, whereas missing steps in multi-step tasks account for 0.7% of failures. Target misidentification, where the robot selects the wrong device or location based on ambiguous instructions, occurs in 0.9% of tasks. Furthermore, hallucinations referencing non-existent devices or locations occur in 0.3% of cases. The scoring mechanism, which performs self-consistency checks through multiple independent LLM queries, achieves a 92% detection rate for planning failures overall.
In addition to planning, LLMs are employed in the generation of 3DSGs. Therefore, we specifically analyzed labeling accuracy during 3DSG construction. We evaluated 80 zones across multiple data centers by comparing LLM-generated labels against human-annotated ground truth. The system achieved an annotation accuracy of 92.5%. Thanks to the polling mechanism, zones with low confidence scores are identified and receive generic labels such as “aisle.” They are also flagged for operator review during initial mapping.
During execution, real-time monitoring provides continuous protection. LiDAR-based collision detection can trigger emergency stops if obstacles enter safety margins. Progress monitoring applies timeout and retry logic to each action, with failures triggering replanning or fallback. Localization verification continuously validates robot pose against the map, pausing for relocalization if confidence drops. Health monitoring checks sensor and actuator status, aborting missions if critical failures occur.
When failures occur despite these preventive measures, our fallback protocols provide graceful degradation. If LLM plan validation fails, the system switches to a rule-based planner with predefined inspection patterns. If navigation is persistently blocked by dynamic obstacles, the system replans the path and, if unsuccessful, requests operator assistance. If sensors fail during inspection, the system skips affected tasks, continues with remaining tasks, and logs the issue for maintenance. If communication with the cloud is lost, the system completes the current task using cached patterns before returning to home. For critical safety violations, the system executes an immediate emergency stop and notifies the operator.
Conclusion
This article has detailed the design, key technologies, and real-world validation of an LLM-enhanced autonomous robot system for intelligent data center O&M. We demonstrated that the integration of a hierarchical 3DSG with LLMs creates a powerful foundation for context-aware robotic autonomy in complex, mission-critical environments. The 3DSG provides a unified spatial-semantic knowledge base, while the LLM serves as a cognitive engine to interpret high-level human instructions as well as to generate robust and executable plans. The effectiveness of the system has been proven through extensive large-scale deployment. Operating across data centers with more than 12,000 devices, our robot fleet successfully navigated over 10,000 km autonomously, detected and reported more than 1200 anomalies, and increased inspection efficiency by over 50%. As quantitatively analyzed in the previous section, the semantic reasoning enabled by the 3DSG was crucial for reliably completing a wide range of inspection tasks, from specific device checks to area-wide patrols.
This work validates the potential of the integration of LLMs and 3DSGs to transform traditional data center O&M practices, paving the way for truly intelligent, adaptive, and scalable infrastructure management systems. Future work will focus on multi-robot collaboration, predictive maintenance using the collected time-series data, and closing the loop with automated remediation actions. In addition, further evaluations will be conducted on perception robustness across diverse data center environments, strengthening overall system reliability for mission-critical deployments.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by Key R&D Program of Shandong Province, China, under Grant 2024CXGC010213, and Key R&D Program of Shandong Province, China, under Grant No.2023CXPT094.
Declaration of competing interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
Appendix: Further examples of inspection results
To complement the results presented in the main contents of the article, we include additional examples of the inspection outputs generated by the autonomous robot system. Figure 11 illustrates the task management interface, where operators can monitor real-time patrol progress and review task execution logs. Figure 12 displays a sample alarm report, highlighting anomalies such as overheating components and faulty indicator states, which were automatically flagged and logged in the system. Figure 13 shows an asset inventory snapshot, demonstrating the ability of the robot to track and update equipment status in real time. These examples tangibly demonstrate the system’s capability to generate actionable insights, directly contributing to the operational efficiencies and anomaly detection performance described in the main contents of the article.
