Design and deployment of an large language model-enhanced autonomous robot for intelligent operation and maintenance in data centers

Abstract

The exponential growth of global data demands unprecedented reliability and efficiency of data center operations. Traditional manual inspection methods remain inefficient and error-prone. This article presents an autonomous mobile robot enhanced by large language models and hierarchical three-dimensional scene graphs for intelligent operation and maintenance. The core innovation lies in enabling semantic-aware navigation, allowing the robot to interpret high-level instructions like “inspect servers with the red alert light on in cold aisle B” by leveraging the three-dimensional scene graph for spatial-semantic reasoning and the large language model for task parsing. Deployed in multiple ultra-large data centers, the system has improved inspection efficiency by over 50%, autonomously patrolled over 10,000 km, and identified more than 1200 equipment anomalies. This work demonstrates a significant step towards fully autonomous, intelligent infrastructure management.

Keywords

Service robots autonomous mobile robots data center operation and maintenance three-dimensional scene graphs large language models semantic navigation

Introduction

The rapid expansion of digital services, cloud computing, and edge applications has significantly increased the scale and complexity of data centers. These facilities now house tens of thousands of servers, storage systems, and network devices, requiring continuous monitoring and maintenance to ensure high availability, energy efficiency, and operational safety.^1,2 Traditional operation and maintenance (O&M) approaches, relying heavily on manual inspections and human intervention, are increasingly inadequate. They are labor-intensive, error-prone, and unable to meet 24/7 uptime demands. In response to these challenges, autonomous mobile robots have emerged as a promising solution to augment or replace human operators in performing repetitive, hazardous, or precision-critical tasks.^3,4 Equipped with multi-modal sensors such as RGB cameras, thermal imagers, and LiDAR (Light Detection and Ranging), these robots navigate complex indoor environments, collect data, and detect anomalies in real time.^5–7 However, most existing robotic systems for data centers are limited in scope—focusing on specific tasks such as temperature monitoring,^2,8 asset tracking,⁹ or simple anomaly detection.^1,6 Their key limitation is the lack of a unified and intelligent framework capable of understanding high-level semantic instructions and autonomously planning complex tasks. They navigate to pre-defined coordinates rather than interpreting commands like “go and do something.”

This article addresses this gap by presenting the design, implementation, and deployment of a large language model (LLM)-enhanced autonomous robot system tailored for intelligent O&M in large data centers. A key innovation is the integration of a hierarchical three-dimensional (3D) scene graph (3DSG) that provides rich spatial-semantic understanding of the environment, enabling the robot to reason about objects, regions, and their relationships. This is combined with an LLM that allows the robot to parse natural language instructions and convert them into actionable navigation as well as inspection plans. Our system supports a wide range of O&M tasks, including automated equipment inspection, environmental monitoring, and asset inventory management. It has been deployed across multiple data centers covering over 12,000 devices, accumulating more than 10,000 km of autonomous patrol mileage. To date, it has identified over 1200 equipment anomalies, demonstrating a more than 50% improvement in inspection efficiency compared to manual methods. This work represents a significant advancement toward fully autonomous, intelligent infrastructure management, with implications for future smart data centers and unmanned operations.

Related work

Robotic systems in data centers

The use of robotics in data centers has gained traction over the past decade,¹⁰ primarily for environmental monitoring and asset management.^2,8 Early work by IBM researchers introduced mobile robots for thermal mapping and anomaly detection.¹ These systems used basic sensors and navigation modules to collect temperature data and identify hotspots. Similarly, Mansley et al.⁵ presented a mobile robotic system for temperature and humidity sampling, using interpolation methods to reconstruct thermal maps and identify cooling inefficiencies. While effective, these systems were limited to reactive monitoring and lacked semantic understanding of the environment. More recent efforts have explored cloud-based robotic platforms for scalable monitoring.^6,11,12 For instance, the authors of Terrissa et al.⁶ proposed a cloud robotics architecture using ROS (robot operating system) for autonomous navigation and environmental sensing. Their system allowed remote users to dispatch robots to specific rooms for data collection. However, these platforms typically rely on predefined waypoints and lack the ability to interpret high-level task descriptions or adapt to dynamic instructions.

Navigation and scene understanding

Accurate navigation and environment mapping are critical for autonomous operation in complex indoor environments. Traditional SLAM (simultaneous localization and mapping) techniques have been widely used in data center robots,^13,14 but they often produce geometric maps that lack semantic information. Recent advances in 3D scene representation, such as 3DSGs, offer a promising direction by encoding both spatial and semantic relationships between objects and regions.^15–17 These hierarchical representations enable robots to reason about their environment at multiple levels of abstraction, facilitating more intelligent task planning and execution.

Human–robot interaction and task planning

Most existing robotic systems for data centers rely on pre-programmed tasks or low-level command interfaces.^6,10,18 This limits their flexibility and usability, especially in dynamic environments where tasks may change frequently. The integration of natural language processing (NLP) and LLMs into robotic systems has recently shown great potential for bridging this gap. LLMs can interpret ambiguous or context-rich instructions and translate them into structured task plans.¹⁹ This capability is especially valuable in data center settings, where operators may issue commands such as “check the overheating switch in row 5” without specifying exact coordinates or device IDs.

Asset tracking and anomaly detection

Several studies have focused on automating asset tracking and fault detection using mobile robots.¹⁰ For example, Nelson et al.⁹ demonstrated a vision-based robot capable of reading LED indicators and tracking assets using active radio frequency identification (RFID) tags. Others have used thermal imaging and optical cameras to detect anomalies such as overheating or equipment failures.^2,8 While these systems improve upon manual inspection, they often operate in isolation and lack integration with broader facility management systems.

System integration and deployment at scale

Few studies have reported large-scale, real-world deployments of autonomous robots in operational data centers. Most prototypes are tested in controlled lab environments or small-scale server rooms.^5–7 The scalability of hardware, software, and data processing pipelines remains a significant challenge. Our work addresses this gap by presenting a fully integrated system deployed across multiple data centers, supporting thousands of devices and extensive autonomous operation.

System design of the data center intelligent O&M robot

The robot is designed as a modular system, integrating hardware for environmental interaction and software for intelligent decision-making, aligned with service robot design principles for industrial environments. A visual overview of the robot platform is provided in Figure 1.

Figure 1.

Illustration of the autonomous mobile robot platform for intelligent data center operation and maintenance, showing its modular design including the sensor suite, robotic arm, and mobility base.

Hardware architecture

The inspection robot system is built upon a self-developed hardware platform, designed for high-precision autonomous operation in complex and equipment-dense server rooms. The hardware comprises three core modules: a differential-drive chassis for mobility, a 6-degree-of-freedom collaborative robotic arm for flexible manipulation, and a multi-modal sensor suite for environmental perception.

Main control unit

The system utilizes a high-performance X86 architecture industrial computer as the central processing core, running the ROS along with relevant navigation and positioning algorithm modules. The main control unit is responsible for multi-source sensor data fusion, task scheduling and management, robotic arm motion planning and control, inspection procedure execution, data communication with the cloud operations platform, and docking control with charging stations, enabling fully intelligent workflow.

Motion unit

The two-wheel differential-drive chassis with hub motors enables flexible movement through narrow server aisles. High-precision closed-loop speed control ensures stability. The motion control board communicates in real-time with the main control unit via a CAN bus, receiving linear and angular velocity commands and providing feedback on the actual operational status of the chassis. Safety features include impact-sensitive strips for immediate stop on collision. Emergency stop buttons are located on both sides of the body. When pressed, they simultaneously lock the robotic arm and the drive wheels.

Sensor unit

To meet the requirements for navigation, identification, and monitoring, the robot integrates various sensors to form a multi-layer perception system:

LiDAR (light detection and ranging) and IMU (9-axis inertial measurement unit): Two single-line LiDAR sensors are configured in a diagonal arrangement, enabling 360 $\circ$ environmental scanning and supporting high-precision SLAM mapping as well as real-time obstacle avoidance. An IMU communicates with the main controller via a serial port, providing altitude information to assist the navigation system with motion error compensation and trajectory correction.

RGB-D and HD cameras: Deployed at the end of the robotic arm or on the front of the chassis, these visual sensors support visual measurements collecting, real-time video stream transmission, and scene perception.

Industrial camera: Mounted on the end of the robotic arm, and combined with AI vision algorithms, it enables image enhancement and intelligent identification of equipment indicator light status, meter readings, and scenarios involving mesh occlusion.

Infrared thermal imager: Also located at the end of the robotic arm, it is used to capture the surface temperature distribution of equipment like servers.

Robotic arm module

The robot is equipped with a 6-degree-of-freedom collaborative robotic arm installed on the top structure of the body. It possesses flexible operation and detection capabilities, allowing it to perform the following tasks:

Close-range image capture and visual identification.

High-power RFID tag reading and equipment identification.

Multi-modal equipment diagnosis (visual + infrared) by combining thermal imaging and visible-light cameras.

The aforementioned hardware modules are integrated through unified electrical and communication interfaces. Under the coordination of the main control system, they work together to support the comprehensive inspection functions of the robot in complex industrial scenarios.

Software architecture

The software architecture of the robot is designed for robust autonomy and is built upon a core onboard software system running on the ROS. This system integrates several key modules that execute directly on the X86 industrial computer of the robot. The navigation stack combines LiDAR-based SLAM¹³ with real-time path planning and dynamic obstacle avoidance, enabling precise movement through narrow server aisles. The multi-modal perception module is responsible for synchronizing and processing data streams from the cameras, thermal imager, and LiDAR. It runs specialized AI models for tasks such as equipment status recognition (e.g. LED state classification) and visual meter reading. Central to the intelligence of the system is the semantic perception module, which maintains a local instance of the hierarchical 3DSG, performing real-time object recognition and spatial reasoning to understand environmental context. Finally, the task execution engine sequences and manages the behaviors of the robot, orchestrating navigation commands, robotic arm maneuvers, and inspection routines to complete assigned patrols and tasks autonomously.

To address the demands of large-scale deployments, this core system can be seamlessly upgraded to a premium cloud-edge-end architecture. This enhanced configuration is implemented upon customer request or in scenarios requiring centralized management and advanced computational services. In this setup, the end layer consists of the robots themselves, performing the aforementioned core functions. The edge layer, deployed locally within each data center zone, hosts low-latency services such as a local task execution engine and real-time anomaly filtering, which reduces bandwidth usage and ensures operational responsiveness. The cloud layer then provides high-level services, including the LLM-based instruction parser for natural language tasking, the central 3DSG engine for maintaining a unified spatial-semantic model across the entire facility, and large-scale data analytics for long-term trend analysis and model retraining. This scalable architecture ensures that the system can evolve from a standalone intelligent agent to a fully integrated, fleet-wide management solution for the most complex operational environments.

This modular hardware and software architecture provides the foundation for the core intelligent behaviors of the robot. The following section details the key technologies that enable these capabilities, focusing on the hierarchical 3DSGs for environmental understanding and the LLM for high-level task reasoning.

Key technologies for the data center intelligent O&M robot

Hierarchical 3D scene graphs for spatial-semantic perception

A holistic understanding of the complex and critical data center environment is fundamental to enable autonomous inspection and interaction of the robot. To address the limitations of traditional geometric mapping and enable context-aware inspection tasks, we design a data center-adapted hierarchical 3DSG representation that seamlessly integrates geometric, semantic, and relational data. This approach is inspired by our previous work,¹⁵ but is specifically tailored to address the unique challenges of data centers, such as the prevalence of structurally similar server racks, the importance of functional areas (e.g. hot aisle/cold aisle), as well as the need to identify specific assets for maintenance tasks and to respond to task-specific semantic requirements (e.g. server status monitoring and cabinet asset association). The 3DSG encodes both geometric precision and semantic context across hierarchical layers, enabling the robot to reason about its environment from individual devices to the entire data center.

Overall architecture of data center 3DSG

The hierarchical 3DSG for data center inspection is formally defined as $G = (V, E)$ ,^15,16 where $V = ⋃_{k = 1}^{K} V_{k}$ denotes nodes across five layers: fundamental, object, room, floor, and building. Edges connecting nodes within the same layer or adjacent layers are denoted by $E$ . Each node $v_{k, j}$ (the $j$ th node of the $k$ th layer) is annotated with data center-specific attributes $C_{k, j}$ generated via multi-modal sensing and LLM reasoning. This architecture is tailored to two core needs of data center inspection: spatial-semantic alignment linking geometric positions to equipment semantics for precise task execution, and scalability accommodating both micro-tasks and macro-tasks (e.g. “patrol all server rooms on the third floor”) via layer-wise aggregation. The process for generating this hierarchical graph from raw sensor data is illustrated in the block diagram of Figure 2.

Figure 2.

Block diagram illustrating the generation process of the hierarchical three-dimensional scene graph (3DSG) from raw multi-modal sensor data to structured spatial-semantic layers.

Layer-by-layer construction for data center environments

As the bottom-most layer, Layer-1 establishes a high-precision spatial-semantic baseline for the robot, integrating geometric mapping with data center-specific semantic labels to enable collision-free navigation and upper-layer construction. The robot leverages its on-board multi-modal sensing unit including a 16-line LiDAR, an RGB-D camera, and an IMU to collect data tailored to data center constraints. A LiDAR-inertial odometry pipeline¹³ is adopted to estimate the robot’s pose with centimeter-level accuracy. This is critical for navigating through narrow cabinet aisles (typically 0.8–1.2 m wide) where small deviations could cause collisions with equipment. For semantic annotation, we fine-tuned the YOLOv8 network²⁰ on a custom data center dataset including over 10,000 images of servers, cabinets, switches, LED indicators, fire suppression devices, and so on. These images consist of open-source data and on-site captures from multiple data centers. The resulting model achieved a mean average precision of 92.6% on our test set, enabling reliable identification of data center-specific classes such as “1U server,” “red alarm LED,” and “UPS (uninterruptible power supply) unit.” These semantic labels and depth measurements from the RGB-D camera are fused into a semantically annotated 3D point cloud, which is transformed from the camera coordinate system to the global data center coordinate system and is also aligned with the data center infrastructure management (DCIM) system.

The semantically labeled point cloud is integrated into a truncated signed distance field (TSDF) via ray-casting,^15,16 generating a voxel-based map optimized for data centers. This map encodes free space by marking accessible paths and restricted zones to guide the robot’s path planning. Moreover, it includes semantic voxel labeling where each voxel stores probabilistic semantic labels to handle occlusions common in dense equipment environments. The voxel map is then converted into a 3D mesh via the marching cubes algorithm.^15,16

Layer-2 models individual data center devices as graph nodes, enabling the robot to associate geometric features with inspection-relevant semantics such as server status and maintenance history. For servers, switches, cabinets, LED indicator, and so on that are identified as critical devices, Layer-2 nodes are initialized by extracting point clouds from segmentation masks of Layer-1, with key data center-specific optimizations. Since servers are housed in standard 42U cabinets, point clouds are clustered by cabinet U-position to avoid merging adjacent devices. High-resolution RGB imaging is employed to segment small but critical objects like 5 mm-diameter LED indicators. Status details, such as the red color indicating an alarm and green corresponding to a normal state, are captured as well. Each node is assigned a unique identifier linked to the DCIM system for asset traceability.

As depicted in Figure 2, each object node is augmented with two descriptors, namely a geometric descriptor and a visual-semantic descriptor to support semantic reasoning. A geometric descriptor is a fused point cloud $P_{2, j}^{(o)}$ aggregated from multiple mapping rounds to capture fine-grained device geometry. A visual-semantic descriptor, on the other hand, is generated via the CLIP model to describe device appearance.

To enrich nodes with task-critical semantics, a large vision language model (LVLM) and LLM generate $C_{2, j}$ , the attribute set for each node. Specifically for data centers, the following attributes are presented as examples:

Device state and operational status derived from visual cues, for example, “Server-SN-12345: red LED lit (alarm state),” “PDU 2 (power distribution unit): voltage 220 V (normal).”

Spatial predicates, mainly include relationships with other devices, for example, “Server-SN-12345 is located in Cabinet-B12-05, 3U position,” “Switch-SW-678 is mounted above Server-SN-12345.”

Affordances, indicating action possibilities for inspection, for example, “Cabinet-B12-05: door can be opened by robotic arm,” “LED of Server-SN-12345 is visible from aisle 1 without occlusion.”

Asset metadata, linked to DCIM via API, for example, “Server-SN-12345: last maintenance 2024-03-15; responsible team: Infrastructure Group 2.”

Attributes are updated in real time. When the robot re-inspects a device, new LVLM observations are fed to the LLM to refresh $C_{2, j}$ , ensuring up-to-date semantics. In addition, temperature, humidity, and noise levels captured by the onboard environmental sensors serve as continuous quantitative monitoring of ambient conditions and are incorporated into the respective node information as well.

Data centers are organized into functional areas rather than traditional rooms, forming Layer-3 nodes of the 3DSG. We segment these areas by clustering the graph of places from Layer-1 using persistent homology.^15,16 The innovation lies in using an LLM to classify and annotate these areas based on the object nodes they contain. For a data center, typical labels include “server_room,” “cold_aisle,” “hot_aisle,” “power_room,” and “network_operation_center.” To ensure robust and accurate labeling, we employ a polling mechanism.¹⁵ The LLM is queried over multiple rounds with the set of objects in an area and a list of typical data center labels. An area node $v_{3, ℓ}$ is only annotated with a label if the LLM consistently selects it across all polling rounds.¹⁵ This high-confidence strategy prevents mislabeling and effectively identifies multi-functional or transitional spaces (e.g. a corridor between aisles), which are instead described with a concise textual summary for downstream tasks.

For large data center campuses, upper layers represent entire floors and buildings. These nodes are formed by clustering areas with similar heights and are annotated by the LLM based on the constituent area nodes. This provides a scalable representation essential for campus-wide tasking and logistics.

It is worth noting that this 3DSG representation is fundamentally different from traditional point cloud or occupancy grid maps. While those capture only geometry, the hierarchical structure of the 3DSG embeds semantic knowledge which forms the fundamental knowledge base that the LLM can reason over.

Data structure and computational analysis

To enable seamless interaction with the robot’s LLM-based task parser described in the next section and data center management systems, the 3DSG is represented as a NetworkX graph serialized in JSON format.^15,16 Key extensions for data centers include DCIM compatibility. Each node includes a “dcim-id” field, for example, “dcim-id: Server-SN-12345” to link to the DCIM system, enabling queries like “retrieve maintenance history of Server-SN-12345.” Furthermore, nodes are tagged with task relevance such as “alarm-prone” and “high-priority” to prioritize patrol routes. Figure 6 provides an example of the resulting hierarchical 3DSG data structure, illustrating the nodes (e.g. floors, rooms, and servers) and their attributes such as server status and inspection history. It is described in detail in the subsequent section holding experimental results.

The construction and maintenance of the hierarchical 3DSG involve several computational stages, each with distinct resource requirements and scaling characteristics, as summarized in Table 1 measured on a per-frame basis during deployment (The main computing unit is a processor with a 11th Gen Intel Core i9-11900 (8 cores/16 threads, base frequency of 2.50 GHz), 32 GB memory, and a NVIDIA Quadro RTX 4000 graphic card.). Note that LLM and LVLM computations are offloaded to cloud APIs, imposing negligible local resource burden regardless of data center scale. The hierarchical architecture enables favorable scaling through two key mechanisms. First, incremental updates ensure that when new assets are added, only affected local regions require recomputation rather than the entire graph. As the 3DSG is stored as a structured JSON file mentioned above, updates are only limited to node/attribute modifications, minimizing computing overhead for large-batch asset additions. Second, the asynchronous module design decouples processing stages: Layer 1 operations run continuously during deployment, while Layer 2 processing can be scheduled flexibly. Consequently, doubling asset count increases total *computational load by less than 100%, demonstrating sub-linear scaling.

Table 1.

Computational performance of 3DSG construction stages.

Stage	Average time per frame	Resource utilization	Update
Semantic extraction + TSDF reconstruction	400 ms	CPU: 16.6%	Incremental
		Memory: 18 MB
		GPU: 1104 MB
Semantic voxelization + clustering	95 ms	CPU: 517.6%	On-demand and incremental
		Memory: 894 MB
		GPU: 0 MB
Sensor data processing (YOLOv8 during inspection)	40 ms	CPU: 16.6%	Extensible via fine-tuning
		Memory: 18 MB
		GPU: 866 MB

3DSG: three-dimensional scene graph; TSDF: truncated signed distance field; CPU: central processing unit; GPU: graphics processing unit.

3DSG-enabled multi-modal perception fusion for anomaly detection

The efficacy of data center operations relies on the availability of comprehensive, contextualized, and accurate observational data. The primary function of our robotic system is to serve as a highly efficient mobile sensing platform, systematically gathering rich multi-modal data from the environment. The core innovation that enables this is the integration of diverse sensory data within the unifying framework of the hierarchical 3DSG. This approach transforms raw, high-volume sensor streams into a structured, spatially grounded, and semantically annotated information base, which is then made available for review and analysis by data center operators.

The hierarchical 3DSG acts as the central repository that organizes these disparate data streams. Instead of storing sensor readings as isolated data points, the system fuses them as dynamic, time-stamped attributes of the relevant nodes within the graph. This spatial and semantic grounding is the key advantage. For instance, a thermal image is not just a picture; it is intrinsically linked to the specific server rack node in the object layer of the graph. Similarly, a visual observation of the color of an indicator light is attached directly to the corresponding server node, and a reading from an analog meter is associated with its power distribution unit node.

The outputs from specialized AI perception modules—such as those for cabinet segmentation, signal light identification, and meter reading—are used to populate this unified graph with structured observations. These modules do not function as autonomous anomaly detectors but as sophisticated data annotators. For example, a model identifies a “red LED” state, and this observation is recorded as a semantic attribute of the corresponding device node. Similarly, a meter reading is digitized and stored as a numerical value associated with its node.

This integrated approach, centered on the 3DSG, culminates in a powerful data collection and presentation system. It empowers human operators with a holistic, multi-modal view of the state of the data center. Instead of sifting through disconnected alarm logs and video footage, operators are presented with a semantically searchable and spatially organized knowledge base. This significantly reduces the cognitive load in situational assessment, allowing them to make more informed and timely decisions based on a comprehensive set of contextualized observations collected by the robot.

LLM-enhanced intelligent navigation and task planning

The hierarchical 3DSG described in the previous section serves as a rich, structured knowledge base of the data center environment. To translate this static representation into dynamic, intelligent behavior, we leverage LLMs as a core reasoning engine. This LLM-enhanced approach allows the inspection robot to interpret complex natural language commands, perform semantic searches over the 3DSG, and generate logically sound navigation and task plans, moving beyond simple waypoint following to true context-aware autonomy.

From 3DSG to executable plans

Our navigation and task planning framework, illustrated in Figure 3, follows a two-stage process inspired by SayPlan¹⁹ and Cheng et al.²¹ but tailored for industrial inspection tasks. It consists of a semantic search stage and an action planning stage.

Figure 3.

Framework of the large language model (LLM)-enhanced intelligent navigation and task planning system, showing the two-stage process from natural language instruction to executable robot actions.

Given a high-level task instruction, for example, “Inspect the PDU in the server room adjacent to cold aisle A3,” the LLM is first tasked with a semantic search over a collapsed version of the 3DSG, as shown in Figure 3. This simplified graph contains only higher-level nodes. Then the LLM reasons about the task requirements to identify a relevant subgraph, a subset of the full 3DSG containing the nodes necessary to solve the task. For a data center, this involves understanding functional relationships, such as which server room contains the specified aisle and which PDU is associated with that room. Consequently, instead of processing the entire 3DSG at once, only the nodes belonging to the identified relevant subgraph pertinent to the task are provided in detail to the LLM for the action planning stage. This retrieval-augmented approach is key to applying our system in large-scale data centers.

Once the relevant subgraph is identified, the LLM is prompted to generate a sequence of actionable steps. This includes navigational actions such as “goto_room,” “goto_aisle,” “goto_node” and inspection-specific actions such as “scan_rack_label,” “check_temperature_sensor,” and “inspect_PDU_status.” The plan is grounded in the spatial relationships and node attributes encoded in the 3DSG, ensuring feasibility.

Prompt engineering and scoring mechanism for reliable data center operations

The reliability of LLM-based planning is crucial in a critical environment like a data center. We employ several strategies to enhance robustness.

First, the prompt templates are carefully engineered, as illustrated in Figure 7 of the section presenting performance evaluation results. It can be seen that the prompt template consists of several core components: a system role definition, that is, “You are a data center O&M robot agent,” the current 3DSG context, the task instruction from the user, a set of allowed actions, strict output format constraints, and, crucially, examples of successful task decomposition. This design effectively constrains the general reasoning capabilities of the LLM to the specific domain of robot task planning, significantly improving the stability and correctness of the generated plans.

To mitigate the risk of the hallucination issues common for LLMs, we employ a scoring mechanism¹⁵ where the LLM double-checks the rationality of its proposed action sequences. For instance, specifically tailored for the data center inspection scenario, before generating a plan to “inspect the backup PDU,” the LLM can be queried to validate if the target PDU node indeed has an attribute like “function: backup.” This self-validation step enhances the credibility of the generated plans.

Tailoring for data center inspection

The general LLM-based planning framework is specifically adapted to meet the stringent demands of data center operations, ensuring both the performance and the practicality. This tailoring is achieved through several key adaptations. First, the action lexicon of the robot is composed of task-oriented verbs that directly support its inspection mission, such as “record_thermal_image” and “verify_power_led,” moving beyond generic navigation commands to enable precise operational tasking. Furthermore, the reasoning capabilities of the LLM are honed to understand data center-specific functional layouts. It is prompted to comprehend the logical relationships between critical areas, such as the airflow sequence from cold aisles to server racks and then to hot aisles. This spatial understanding enables the generation of highly efficient inspection paths. For example, planning a complete thermal survey route that minimizes redundant travel. Finally, the inherent scalability of the hierarchical 3DSG is fully leveraged. The system seamlessly scales from managing a single server room to encompassing an entire data center hall or multi-building campus. The LLM can efficiently narrow its semantic search to a specific floor or functional zone within the graph, ensuring the planning process remains robust and computationally manageable even in vast, complex environments.

Deployment and performance validation

The deployment of the proposed autonomous robot spans a number of data centers featuring diverse structures with altogether over 12,000 devices, covering a total floor area of 50,000 m $^{2}$ . The system has been operational for 12 months, during which the robot fleet accumulated over 10,000 km of autonomous patrol mileage. In the remainder of the this section, we first present the deployment of the robot in a data center belonging to one of our biggest clients as an example, demonstrating a complete operational loop—from environment perception and model building to intelligent task planning, physical execution, and result integration. Then extensive results of LLM-enhanced navigation and task planning based on 3DSGs are shown.

Deployment of the autonomous robot in a large data center

This section describes the deployment of the autonomous robot in a large, multi-room data center facility of a client of ours. The data center comprises three server rooms, each containing approximately 200 server cabinets. The facility required daily inspection across two shifts to ensure operational stability and asset security. Key challenges included the high density of equipment, the need for accurate asset tracking, and the labor-intensive nature of manual inspections.

The deployment process began with the robot performing an initial mapping round. Utilizing its LiDAR-inertial odometry pipeline and fine-tuned perception models, the robot constructed a centimeter-accurate geometric map while simultaneously segmenting and classifying critical objects. This raw perceptual data was then fused into the hierarchical 3DSG, establishing the layers as presented in previous sections. The semantically annotated voxel map of Layer-1 enabled collision-free navigation, while Layer-2 instantiated nodes for individual devices, enriching them with geometric and visual-semantic descriptors. An LLM was subsequently employed to annotate Layer-3 functional areas based on the objects contained within them, creating a comprehensive spatial-semantic model of the environment.

With the 3DSG serving as a dynamic knowledge base, the system demonstrated its capability for intelligent task execution. Operators could issue high-level natural language commands, which the LLM-based instruction parser would interpret. The parser performed a semantic search over the 3DSG to identify the relevant subgraph and reason about the relationships between areas and equipment. It then generated a grounded, executable plan comprising a sequence of navigational and inspection-specific actions.

During task execution, the robot autonomously navigated to the specified locations using the 3DSG for context-aware path planning. Upon reaching a target, it leveraged its multi-modal sensors to perform the required inspection actions. The robotic arm positioned the thermal imager and high-resolution camera to capture data. The entire process, from navigation to sensor deployment, is monitored in real-time through our cloud-edge platform, as illustrated in Figure 4 (The online platform used by the robot system is currently implemented in Chinese, as all deployments of the robot to date have been domestic. For the purpose of this article, English annotation has been included in the relevant example outputs presented in this section and in the Appendix.). Real-time sensor readings were processed by onboard AI modules, and the results—such as temperature readings and equipment statuses—were immediately fused back into the corresponding nodes in the 3DSG as updated time-stamped attributes.

Figure 4.

Real-time monitoring interface of the cloud-edge platform, displaying the current task execution of the robot, navigation status, and sensor data streams during an inspection patrol.

This integrated workflow yielded significant operational results. The robots successfully identified a substantial number of equipment anomalies, such as overheating components and faulty indicator states, which were immediately flagged in the system for operator review. A screenshot exemplifying the final output, presenting structured anomaly reports and multi-modal data (e.g. thermal images and alarm light status) for operator review, is shown in Figure 5 (cf. the Appendix for more examples of the inspection results). The inspection frequency increased dramatically compared to manual methods. Prior to robot deployment, manual inspections were conducted twice per day. With the autonomous robot system, inspection frequency increased to a maximum of 12 times per day, representing a 600% increase in patrol coverage. This ensures more timely anomaly detection and continuous environmental monitoring. Furthermore, the automation of data collection and its direct integration into the semantically structured 3DSG eliminated manual report compilation, drastically reducing administrative overhead. Previously, compiling monthly statistical reports took approximately 4 days of manual effort. With the automated system, reports are now generated on-demand and in real-time, effectively reducing report preparation time by over 75%. Define the inspection efficiency gain as the percentage of reduction in time required to complete an inspection route and to compile the report compared to manual inspection, averaged over multiple trials across different data centers:

Efficiency \; Gain = \frac{1}{N_{trials}} \sum_{i = 1}^{N_{trials}} \frac{T_{manual}^{(i)} - T_{robot}^{(i)}}{T_{manual}^{(i)}} \times 100 %

(1)

where

T_{manual}^{(i)}

and

T_{robot}^{(i)}

represent the time required to complete the inspection route and report compilation for the

i

th inspection trial by manual and our proposed data center inspection robot, respectively. Based on data observed and retrieved from

N_{trials} = 120

inspection trials across three data centers, an efficiency gain of 51.6% is obtained.

Figure 5.

Representative inspection results output, showing structured anomaly reports with corresponding thermal images and visual evidence for operator review (bottom).

LLM-enhanced task planning-based 3DSGs

To quantitatively evaluate the performance of our LLM-enhanced navigation and task planning system, we conducted extensive experiments across three representative data center scenarios. An example of the data structure of the hierarchical 3DSGs is depicted in Figure 6 (A visualization of the 3DSG in the form of a semantic-annotated mesh cannot be included in this article due to confidentiality agreements and the sensitive nature of the data centers. Figure 6 provides a schematic representation of the data structure instead.). It expands from the “Floor” layer through the “Room” layer down to the “Device” layer. Each room node and server equipment node in the 3DSG has its own unique node attributes. Room node attributes include the room name, serial number, location, belonging floor, room area, and node information. The location information is obtained from geometric mapping, and the node information is extracted by the LLM based on the characteristics of the room. Server node attributes include the device name, serial number, location, belonging room serial number, DCIM serial number, detection history record, latest detection details, and voltage status. Among them, the detection history record logs the inspection records of this server, where “0” represents normal operation and “1” represents abnormal operation. The latest detection details include the anomaly level and processing status from the last inspection of this device. The first element is the anomaly level: “1” for minor warning, “2” for serious warning, and “3” for critical warning. The second element is the processing status of the anomaly: “0” for unresolved and “1” for resolved, updated in real-time. On the other hand, “ $[]$ ” indicates the device was operating normally during the last inspection. The voltage status indicates the voltage status of the server during the last inspection, where “T” means normal voltage and “F” means abnormal voltage.

Figure 6.

Schematic representation of the hierarchical three-dimensional scene graph (3DSG) data structure for data center environments, illustrating node relationships and attributes across floor, room, and device layers.

The prompt structure for the task planning used in the experimental setup is illustrated in Figure 7. It encompasses the role of the agent, the task instruction, the 3DSG, the robot location, inspection action commands, the output format, and an example. The agent role defines the inspection tasks for the data center O&M robot agent. The task instruction is the inspection command issued by the user. The location is the real-time position of the data center O&M robot, obtained by a re-localization algorithm. The inspection action commands showcase the inspectable actions that the LLM can plan, including going to a room or floor, approaching server equipment, scanning device labels, checking temperature, detecting power, and checking PDU status. The output format defines the specification for the output of the LLM, containing step-by-step inspection action commands and a chain of thought. The example provides a sample inspection task to guide the task reasoning of the LLM.

Figure 7.

Example prompt structure used for LLM-based task planning, showing the components including agent role, task instruction, 3DSG context, and output format specification. LLM: large language model; 3DSG: three-dimensional scene graph.

A key aspect of our validation was to assess the impact of enriched semantic node information on the planning reliability of the LLM. As summarized in Table 2, for each test scenario, the LLM was leveraged to analyze the 3DSG and generate distinctive node information for each room or area, such as the number of servers and their historical inspection patterns. This high-level summary enabled the LLM to reason about user intent and identify the most relevant areas for a given task.

Table 2.

Room node information generated by the large language model (LLM) for the three experimental scenarios, showing server counts and historical inspection patterns used for semantic reasoning.

Scene	Room	No. of servers	Node information
Scenario 1	server_room_1	48	Historical inspection records show four inspections. This room has the highest number of servers.
	server_room_2	38	Historical inspection records showfive inspections. Server count is between the other two rooms.
	server_room_3	8	Historical inspection records showeight inspections. This room has the fewest servers.
Scenario 2	server_room_west	8	Contains eight servers and has undergone 6 historical inspections.
	server_room_east	48	Contains 48 servers and has undergone seven historical inspections.
	server_room_center	12	Contains 12 servers, with the fewest historical inspections (4).
Scenario 3	server_room_north	36	Contains server models ending in 1–6 (e.g. Server A1and Server B2); seven historical inspection records.
	server_room_south	36	Contains server models ending in 7–12 (e.g. Server A7 and Server B8); five historical inspection records.

The semantic search and task planning results across the three scenarios are presented in Figures 8 to 10, respectively.

Figure 8.

Semantic search and task planning results for Scenario 1, comparing planning performance with and without access to semantic node information for batch inspection and historically contextual inspection commands.

Figure 9.

Semantic search and task planning results for Scenario 2, demonstrating the capability of the system to handle both straightforward and historically contextual inspection commands.

Figure 10.

Semantic search and task planning results for Scenario 3, showing robust performance in navigating to specific server locations and executing detailed inspection routines.

In Scenario 1, as shown in Figure 8, for a batch inspection command across the entire area, the LLM planner without access to node information frequently required multiple iterative searches or failed entirely by targeting incorrect rooms. In contrast, when provided with node information, the LLM swiftly identified the room most associated with the command of the user and generated a correct, efficient sequence of inspection actions. In Scenario 2 (cf. Figure 9), while the LLM could handle straightforward commands with or without node information, the value of semantics became critical for specific instructions. For instance, commands targeting servers with particular historical records could only be successfully planned and executed when the LLM could reference the relevant node attributes to locate the correct devices. Finally, in Scenario 3 corresponding to Figure 10, the robot consistently succeeded in all tasks by navigating to specific server locations and performing detailed inspections, demonstrating the robustness of the system when fully leveraging the semantically rich 3DSG.

These experiments conclusively demonstrate that the integration of the hierarchical 3DSG with the LLM-based planner is crucial for achieving reliable, context-aware autonomy. The semantic node information directly enables the understanding of complex, ambiguous user commands, allowing the system to fulfill diverse inspection needs that would be infeasible with a purely geometric map or a planner lacking semantic reasoning capabilities.

Furthermore, we compare our proposed hierarchical 3DSG-based method with a single-layer scene graph-based baseline.¹⁷ Using the data center scenario shown in Figure 8, we have conducted experiments with the first two representative task instructions in this example. Each method executed each instruction 50 times. We measured three metrics:

Average token consumption: Reflecting computational efficiency and LLM usage cost.

Success rate: Measuring task completion reliability.

Average LLM processing time: Indicating response latency.

The corresponding results are summarized in Table 3. It can be seen that the proposed method consumes significantly fewer tokens (79% reduction for Instruction 1, 46% reduction for Instruction 2) thanks to the hierarchical structure that enables iterative subgraph retrieval. Consequently, only relevant nodes are expanded for detailed reasoning, avoiding processing the entire graph. This feature renders the proposed approach especially suitable for large-scale data center scenarios. On the other hand, our method achieves substantially higher success rates as shown in Table 3.

Table 3.

Performance comparison between the proposed hierarchical three-dimensional scene graph (3DSG)-based method and a single-layer scene graph-based baseline.

Method	Instruction 1 of Figure 8			Instruction 2 of Figure 8
	Average tokens	Average time (s)	Success rate(%)	Average tokens	Average time (s)	Success rate(%)
Single-layer scene graph¹⁷	6025	12.029	80	6032	11.368	30
Our method	1245	12.556	100	3278	14.564	90

The hierarchical representation allows the LLM to perform targeted semantic search across layers, whereas the single-layer scene graph overwhelms the LLM with flat and unstructured node relationships, leading to confusion and planning failures. Although due to the multi-round iterative search process, the proposed method requires slightly higher average processing time, this acceptable trade-off yields dramatically improved reliability and token efficiency. The response latency is totally acceptable for the execution of data center inspection tasks and is much lower compared to the case where inspection spots can only be identified manually by searching through several sources of data including the data center layout and history inspection records, and then the resulting inspection tasks have to be assigned to the robot via further programming taking spatial information into account.

Failure mode analysis and safety considerations

In this subsection, we conducted a systematic analysis of failure modes observed during 12 months of deployment across multiple data centers. This analysis covers over 1000 task executions. Here we categorize failures into three primary categories according to their origin. Perception failures, occurring when the robot misdetects or misclassifies objects, account for $\sim$ 4.2% of inspection cycles. Planning failures are used to represent all cases where the LLM generates invalid or suboptimal action sequences. They occur in about 3.5% of task executions. Navigation failures, where the robot cannot reach its target due to dynamic obstacles or localization errors, represent the smallest category at 1.8% of patrol missions. These three categories mentioned above are summarized in Table 4 where their frequencies of occurrence and primary impacts are presented as well.

Table 4.

Summary of failure cases.

Failure category	Description	Observed frequency	Primary impact
Perception failures	Errors in object detection, classification, or state estimation	4.2% of inspection cycles	Incorrect or missed anomaly detection
Planning failures	Large language model (LLM)-generated invalid, inefficient, or unsafe action sequences	3.5% of inspection cycles	Task interruption or manual intervention required
Navigation failures	Inability to reach target location due to dynamic obstacles, localization errors, or path planning issues	1.8% of inspection cycles	Incomplete inspection routes

Since planning failures are most directly relevant to the LLM-based components of our system, we provide a detailed breakdown of their subcategories. Action sequencing errors, where operations are performed in the wrong order, represent the most common planning failure at 1.6% of complex tasks, whereas missing steps in multi-step tasks account for 0.7% of failures. Target misidentification, where the robot selects the wrong device or location based on ambiguous instructions, occurs in 0.9% of tasks. Furthermore, hallucinations referencing non-existent devices or locations occur in 0.3% of cases. The scoring mechanism, which performs self-consistency checks through multiple independent LLM queries, achieves a 92% detection rate for planning failures overall.

In addition to planning, LLMs are employed in the generation of 3DSGs. Therefore, we specifically analyzed labeling accuracy during 3DSG construction. We evaluated 80 zones across multiple data centers by comparing LLM-generated labels against human-annotated ground truth. The system achieved an annotation accuracy of 92.5%. Thanks to the polling mechanism, zones with low confidence scores are identified and receive generic labels such as “aisle.” They are also flagged for operator review during initial mapping.

During execution, real-time monitoring provides continuous protection. LiDAR-based collision detection can trigger emergency stops if obstacles enter safety margins. Progress monitoring applies timeout and retry logic to each action, with failures triggering replanning or fallback. Localization verification continuously validates robot pose against the map, pausing for relocalization if confidence drops. Health monitoring checks sensor and actuator status, aborting missions if critical failures occur.

When failures occur despite these preventive measures, our fallback protocols provide graceful degradation. If LLM plan validation fails, the system switches to a rule-based planner with predefined inspection patterns. If navigation is persistently blocked by dynamic obstacles, the system replans the path and, if unsuccessful, requests operator assistance. If sensors fail during inspection, the system skips affected tasks, continues with remaining tasks, and logs the issue for maintenance. If communication with the cloud is lost, the system completes the current task using cached patterns before returning to home. For critical safety violations, the system executes an immediate emergency stop and notifies the operator.

Conclusion

This article has detailed the design, key technologies, and real-world validation of an LLM-enhanced autonomous robot system for intelligent data center O&M. We demonstrated that the integration of a hierarchical 3DSG with LLMs creates a powerful foundation for context-aware robotic autonomy in complex, mission-critical environments. The 3DSG provides a unified spatial-semantic knowledge base, while the LLM serves as a cognitive engine to interpret high-level human instructions as well as to generate robust and executable plans. The effectiveness of the system has been proven through extensive large-scale deployment. Operating across data centers with more than 12,000 devices, our robot fleet successfully navigated over 10,000 km autonomously, detected and reported more than 1200 anomalies, and increased inspection efficiency by over 50%. As quantitatively analyzed in the previous section, the semantic reasoning enabled by the 3DSG was crucial for reliably completing a wide range of inspection tasks, from specific device checks to area-wide patrols.

This work validates the potential of the integration of LLMs and 3DSGs to transform traditional data center O&M practices, paving the way for truly intelligent, adaptive, and scalable infrastructure management systems. Future work will focus on multi-robot collaboration, predictive maintenance using the collected time-series data, and closing the loop with automated remediation actions. In addition, further evaluations will be conducted on perception robustness across diverse data center environments, strengthening overall system reliability for mission-critical deployments.

Footnotes

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by Key R&D Program of Shandong Province, China, under Grant 2024CXGC010213, and Key R&D Program of Shandong Province, China, under Grant No.2023CXPT094.

Declaration of competing interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

ORCID iD

Yao Cheng

Appendix: Further examples of inspection results

To complement the results presented in the main contents of the article, we include additional examples of the inspection outputs generated by the autonomous robot system. Figure 11 illustrates the task management interface, where operators can monitor real-time patrol progress and review task execution logs. Figure 12 displays a sample alarm report, highlighting anomalies such as overheating components and faulty indicator states, which were automatically flagged and logged in the system. Figure 13 shows an asset inventory snapshot, demonstrating the ability of the robot to track and update equipment status in real time. These examples tangibly demonstrate the system’s capability to generate actionable insights, directly contributing to the operational efficiencies and anomaly detection performance described in the main contents of the article.

References

McIntosh

Kephart

Lenchner

, et al. Semi-automated data center hotspot diagnosis. In: 7th international conference on network and service management.

Chen

Tan

Wang

, et al. A high-fidelity temperature distribution forecasting system for data centers. In: IEEE 33rd real-time systems symposium.

Licardo

Domjan

Orehovački

. Intelligent robotics—A systematic review of emerging technologies and trends. Electronics 13.

Gao

. Research progress on autonomous operation of industrial robots based on big data and machine vision. Comput Aided Des Appl 2025; 22(S9): 236–249.

Mansley

Connell

Isci

, et al. Robotic mapping and monitoring of data centers. In: IEEE international conference on robotics and automation.

Terrissa

Ayad

Zerhouni

. Robotics based solution for data center e-monitoring. In: International conference on advanced systems and emergent technologies (ICASET).

Hong

Sarantopoulos

Hogg

, et al. Self-maintaining [networked] systems: The rise of datacenter robotics! In: 23rd ACM workshop on hot topics in networks.

Qin

Fang

Wang

. A mobile robotic system for data center thermal environment measurement and reconstruction. In: 2021 China Automation Congress (CAC).

Nelson

Santala

Lenchner

, et al. Locating and tracking data center assets using active RFID tags and a mobile robot. In: 10th international conference and expo on emerging technologies for a smarter world (CEWIT).

10.

Levy

Subburaj

. Emerging trends in data center management automation. In: IEEE 11th annual computing and communication workshop and conference (CCWC).

11.

Warabino

Suzuki

Miyazawa

. ROS-based robot development toward fully automated network management. In: 20th Asia-Pacific network operations and management symposium (APNOMS).

12.

Warabino

Suzuki

Otani

. Robotic assistance operation for effective on-site network maintenance works. In: 22nd Asia-Pacific network operations and management symposium (APNOMS).

13.

Hess

Kohler

Rapp

, et al. Real-time loop closure in 2D LIDAR SLAM. In: Proceedings of the IEEE international conference on robotics and automation (ICRA).

14.

Shan

Englot

Meyers

, et al. LIO-SAM: Tightly-coupled lidar inertial odometry via smoothing and mapping. In: IEEE/RSJ international conference on intelligent robots and systems (IROS).

15.

Cheng

Han

Jiang

, et al. Intelligent spatial perception by building hierarchical 3D scene graphs for Indoor Scenarios with the Help of LLMs. In: World robot condference symposium on advanced robotics and automation (WRC SARA).

16.

Hughes

Chang

, et al. Foundations of Spatial Perception for Robotics: Hierarchical Representations and Real-time Systems. Int J Rob Res 43: 1457–1505.

17.

Kuwajerwala

Morin

, et al. Conceptgraphs: Open-vocabulary 3D scene graphs for perception and planning. In: 2024 IEEE international conference on robotics and automation (ICRA). IEEE, pp.5021–5028.

18.

Wang

, et al. Research on omnidirectional movement and environment detection function of inspection robot. In: 5th international conference on mechanical, control and computer engineering (ICMCCE).

19.

Rana

Haviland

Garg

, et al. SayPlan: Grounding large language models using 3D scene graphs for scalable task planning. In: Conference on robot learning.

20.

https://github.com/ultralytics/ultralytics .

21.

Cheng

Jiang

Han

, et al. Robot navigation based on 3D scene graphs with the LLM tooling. In: World robot condference symposium on advanced robotics and automation (WRC SARA).