Sage Journals: Discover world-class research

Abstract

Graph neural networks (GNNs) and graph transformers (GTs) have shown significant potential in handling graph-structured data. However, GNNs face challenges with over-squashing and over-smoothing, hindering their ability to capture long-range dependencies. GTs can address this through a global attention mechanism, but suffer from high computational overhead due to their quadratic complexity. The selective state space model (SSM), Mamba, known for its linear complexity and excellent performance, offers an attractive alternative. However, Mamba lacks graph inductive biases and handles only sequential data. To ress hese challenges, we propose a new SSM framework with global receptive fields and structure-aware capabilities. We address Mamba’s limitations by repeating node sequences and incorporating a structural encoder to enhance inductive bias. Experiments on eight benchmarks demonstrate competitive accuracy as well as superior speed and scalability over GTs, underscoring the potential of SSMs for graph learning.

Keywords

graph neural networks selective state space model graph representation learning structure-aware mamba

1. Introduction

Graph-structured data are a crucial and ubiquitous form in real-world applications. Social networks (Yang et al., 2017), traffic networks (Bai et al., 2020), and protein molecules (Wang et al., 2023) can all be effectively modeled as graphs. Consequently, the processing and analysis of graph-structured data constitute a significant and widely studied research field.

In recent years, graph neural networks (GNNs) (Kipf & Welling, 2017) have been extensively researched and widely applied. These deep learning techniques are specifically designed for graph-structured data, enabling them to effectively capture the complex relationships between nodes and edges. A classic GNN framework is the message passing neural network (MPNN) (Gilmer et al., 2017). MPNNs iteratively update node features, enabling each node to progressively perceive a broader context. However, due to issues such as over-smoothing (Chen et al., 2020) and over-squashing (Topping et al., 2022), GNNs are primarily suited for extracting local features and perform poorly in capturing long-range dependencies between nodes.

The Transformer (Vaswani et al., 2017), with its global attention mechanism, has achieved remarkable results in natural language processing and computer vision. Applying it to graph learning tasks to address the limitations of GNNs is a natural idea. Graph transformers (GTs) have emerged as a popular alternative to MPNNs (He et al., 2023; Ma et al., 2023; Rampášek et al., 2022; Yun et al., 2019), demonstrating remarkable performance across various graph tasks. The global attention mechanism facilitates direct modeling of long-range interactions between nodes by enabling attention to any node pair within the graph. This mechanism effectively mitigates issues such as over-smoothing and over-squashing by adaptively learning interaction relationships(Diao & Loynd, 2023; Kreuzer et al., 2021). However, the global attention mechanism overlooks the inherent inductive bias in graphs, namely the structural information, which leads to suboptimal performance of GTs on certain datasets (Liu et al., 2024a; Xing et al., 2024).

Furthermore, because GTs calculate attention for each pair of nodes with a complexity of $O (N^{2})$ , it becomes impractical for GTs to scale to larger graphs. To reduce the high computational costs, linear attention methods have emerged. The core step of linear attention involves approximating or restructuring the attention mechanism to reduce the complexity from $O (N^{2})$ to $O (N)$ . Inspired by Big Bird’s linear attention in natural language processing(Zaheer et al., 2020), related linear attention models have also appeared in the graph domain(Rampášek et al., 2022; Shirzad et al., 2023). Although these linear attention models significantly reduce the computational complexity of attention on graphs, they inevitably suffer from performance loss. Therefore, current research efforts focus on designing models with low complexity that do not compromise performance compared to GTs.

In recent years, state space models (SSMs) have witnessed significant advancements. Numerous SSM-based methods, such as linear state-space layers (LSSL) (Gu et al., 2021), structured state space sequence model (S4) (Gu et al., 2022b), and S4D (Gu et al., 2022a), excel at handling sequential data across various tasks and modes, particularly in modeling long-range dependencies (Zhu et al., 2024). These methods exhibit near-linear computational complexity, which is lower than that of Transformers, making them more efficient for processing long sequences. However, these SSM methods compress the context into smaller states, which restricts their memory capacity and consequently limits their performance. Furthermore, the parameters of SSMs remain static across different inputs, which means they lack the ability to perform input-specific reasoning. As a result, SSMs often exhibit inferior performance compared to Transformers.

To address this, Gu and Dao (2023) proposed Mamba, a selective state space model (SSM). Unlike traditional SSMs that compress all historical information, Mamba incorporates a simple selection mechanism by parameterizing the inputs of the SSM. This allows the model to selectively process information, focusing on or ignoring specific inputs. Mamba can filter out irrelevant information while retaining relevant information for an extended period. It is characterized by a balance between efficiency and effectiveness, outperforming Transformers of the same scale in text sequence tasks. In addition, its computational complexity scales linearly with sequence length.

Given Mamba’s exceptional performance in text modeling, applying it to the graph domain presents a promising strategy. However, graph-structured data differ significantly from text sequences, posing two primary challenges for using Mamba on graphs.First, it is necessary to transform graph data into sequence data similar to text to fit Mamba’s ordered input requirements. Second, Mamba must learn and utilize the structural information inherent to graphs. Graph-Mamba (Wang et al., 2024) addresses these challenges by sorting node sequences based on node degree, thereby converting the graph structure into a sequence structure, and by replacing the Transformer with Mamba in the GraphGPS framework to apply Mamba to graphs. Additionally, Graph Mamba (Behrouz & Hashemi, 2024) enhances the structural information of input sequences through multiple random walks and uses a bidirectional Mamba to process unordered graph data.

However, these methods do not adequately address the two primary challenges Mamba encounters on graphs: unidirectional modeling and the lack of structural awareness. To tackle these challenges, we propose the structure-aware Mamba (SAM) framework. As shown in Figure 1, this framework combines a structure-aware selective mechanism for leveraging graph structural features and a symmetric input mechanism for global context modeling. First, we input the graph data into the Structural Information Encoder (SAMEncoder) to extract structural features. Next, we input the node features, edge features, and structural features of the graph into multiple SAMLayers. Finally, the learned graph node representations are fed into the prediction layer for downstream tasks. Each SAMLayer integrates both a structure-aware selective mechanism and a symmetric input mechanism. Additionally, each SAMLayer includes both an MPNN and an MLP to enhance model performance.

Figure 1.

The overview of structural-aware Mamba (SAM) framework. We first extract the structural features of the graph through the SAMEncoder. Then, we input the structural features, node features, and edge features into multiple SAMLayers to obtain the graph representation. Each SAMLayer consists of a structure-aware selection SSM, a symmetric input mechanism (SYM), an MPNN, and an MLP.

Compared to GNNs and GTs, SAM is a method featuring a global receptive field, low computational cost, and high scalability. Unlike direct fusion, which simply sums forward and backward Mamba outputs without distinction, the symmetric input mechanism enables each node to perceive the entire graph before entering the Mamba module. Leveraging Mamba’s selective gating, the model adaptively integrates signals from multiple directions. This design mitigates the performance degradation observed in direct fusion (see Table 1). The structure-aware selective mechanism leverages the structural information of the graph to focus on or filter context information, enabling SAM to concentrate on important nodes with strong inductive biases. Compared to other methods, the SAM shows competitive results across all datasets and outperforms others on multiple datasets. Additionally, our method is more cost-effective than GTs, exhibiting linear scalability and the capability to scale to graphs with thousands of nodes.

Table 1.

Comparison of Direct Fusion and Symmetric Input Mechanism.

	PascalVOC-SP	Peptides-func	CIFAR10	PATTERN	MNIST	Computer	Photo	Physics
Model	F1 score $↑$	AP $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$
Direct fusion	0.3910 $\pm$ 0.0130	0.6654 $\pm$ 0.0121	0.7523 $\pm$ 0.0041	0.8670 $\pm$ 0.0012	0.9813 $\pm$ 0.0009	0.9109 $\pm$ 0.0005	0.9497 $\pm$ 0.0040	0.9675 $\pm$ 0.0009
SYM (ours)	0.4105 $\pm$ 0.0050	0.6796 $\pm$ 0.0073	0.7605 $\pm$ 0.0040	0.8697 $\pm$ 0.0023	0.9829 $\pm$ 0.0005	0.9161 $\pm$ 0.0051	0.9559 $\pm$ 0.0037	0.9722 $\pm$ 0.0019

Our main contributions can be summarized as follows:

We propose a SAM framework that applies Mamba to graph-related tasks. This framework integrates a structure-aware selective mechanism, a symmetrical input mechanism, and MPNN to enhance the modeling capability for graph-structured data.

Our method demonstrates low computational complexity and high scalability, with a lower computational cost than GTs. The runtime of the proposed method is similar to that of linear attention with achieving higher accuracy.

We perform extensive experiments across multiple datasets. The results demonstrate that the proposed method consistently and significantly outperforms GNNs and GTs in terms of accuracy and precision across different datasets.

2. Related Works

2.1. Message Passing Neural Networks

Traditional GNNs, such as MPNN (Gilmer et al., 2017) and GCN (Kipf & Welling, 2017), use message passing to aggregate neighbor information. However, they face limitations like over-smoothing (Chen et al., 2020) and over-squashing (Topping et al., 2022), which hinder the effectiveness of stacking multiple layers and restrict their receptive fields. To tackle these issues, expanding the receptive field of GNNs is crucial. One strategy is altering the graph’s topology. For instance, Half-hop (Azabou et al., 2023) introduces slow nodes to each edge, effectively upsampling the graph and mitigating over-smoothing and over-squashing by decelerating message passing. Another strategy is graph rewiring (Abboud et al., 2022; Barbero et al., 2024; Deac et al., 2022; Di Giovanni et al., 2023; Gutteridge et al., 2023; Karhadkar et al., 2023; Qian et al., 2024; Sonthalia et al., 2023), which enables direct communication between originally non-adjacent nodes, thus breaking information bottlenecks and enhancing learning efficiency.

The essence of these methods is to modify the scope of message passing while adhering to traditional messaging constraints. Traditional mechanisms compel each node to pass messages indiscriminately, without assessing the importance of the messages. Consequently, novel paradigms have been proposed (Finkelshtein et al., 2024), offering more flexible and dynamic strategies. In these paradigms, nodes autonomously determine their update strategies based on their states, thereby exploring the graph’s topological structure more effectively. Busch et al. (2020), Errica et al. (2023), Faber and Wattenhofer (2024), Park et al. (2023) also address this issue by introducing new mechanisms to enhance message passing flexibility.

2.2. Graph Transformers

GTs establish a fully connected graph through a global attention mechanism, providing a comprehensive receptive field and addressing issues like over-squashing and over-smoothing inherent in GNNs. However, GTs often overlook the inductive biases intrinsic to graphs, such as structural information, making it challenging to effectively utilize this information. Hence, recent research has focused on integrating graph structural information into GTs to enhance their effectiveness.

Current methods to achieve this can be categorized into three main approaches (Min et al., 2022): (a) Using MPNNs as auxiliary modules to encode local information. MPNNs capture local structural features and can serve as inputs or enhance the outputs of GTs (Diao & Loynd, 2023; Kreuzer et al., 2021; Rampášek et al., 2022). For example, GraphGPS (Rampášek et al., 2022) combines MPNN modules with GT outputs to improve performance. (b) Proposing various position and structural encodings and concatenating them with node features. Embedding structural information into node features enables GTs to better incorporate graph structures (Dwivedi et al., 2022a; Rampášek et al., 2022). (c) Modifying the attention mechanism to enhance its bias towards graph structures. Liu et al. argue that GTs’ over-globalization shifts attention excessively to distant nodes, neglecting nearby nodes with useful information (Liu et al., 2024a; Ma et al., 2023; Xing et al., 2024). Xing et al. (2024) propose a dual-level global Graph Transformer with inter-cluster and intra-cluster Transformers to mitigate over-globalization. Gradformer (Liu et al., 2024a) combines GTs with inherent inductive biases by applying an exponential decay mask to the attention matrix, balancing the focus on distant and local nodes.

2.2.1. Transformers Costs

Transformers often face challenges related to high complexity and overhead, particularly GTs compared to MPNNs. One solution is reducing attention mechanism complexity or the number of nodes involved in attention computation (Chen et al., 2023; Kong et al., 2023; Rampášek et al., 2022; Shirzad et al., 2023; Wu et al., 2022). However, these methods often lead to performance loss or compromise the global receptive field, highlighting the trade-off between computational cost and effectiveness.

2.3. SSMs

Recently, Structured State Space Models (SSMs) have emerged as promising architectures in sequence modeling (Gu et al., 2022a, 2022b, 2021). A notable example is Mamba (Gu & Dao, 2023), known for its strong and efficient performance. Unlike traditional SSMs, Mamba uses a selection mechanism to filter irrelevant information, retaining relevant data for longer periods. This efficiency and effectiveness allow Mamba to outperform similarly scaled Transformers on text sequence tasks, with linear computational complexity with respect to sequence length. Consequently, researchers are exploring Mamba’s applications in various fields, including vision (Liu et al., 2024b; Zhu et al., 2024), video understanding (Li et al., 2024a), point cloud analysis (Liang et al., 2024), spatio-temporal graph learning (Li et al., 2024b), and graph representation learning (Behrouz & Hashemi, 2024; Wang et al., 2024).

Applying Mamba to graph-structured data presents two challenges: converting graph data into sequence-like data and utilizing structural information. Wang et al. (2024) integrate Mamba into the GraphGPS framework, replacing its attention mechanism with a degree-based ordering to handle Mamba’s unidirectional input. Behrouz and Hashemi (2024) adopts a similar approach by tokenizing node subgraphs and using bidirectional Mamba. However, these methods do not fully resolve Mamba’s unidirectional input problem. Bidirectional Mamba increases sensitivity to input order, but simply concatenating directions does not leverage information from both effectively. Additionally, these methods do not integrate graph structural information into the state space model to guide node updates.

3. Method

The goal of SAM is to incorporate the SSM, Mamba, into graph structures. This section first introduces the foundational concepts of the state space model. Next, an overview of the SAM framework is provided. Then, we describe in detail how our method leverages the structure-aware selective mechanism and the symmetric input mechanism to process graph data, and continue to explain the architectural details of the SAM framework. Finally, this section analyzes the efficiency of the proposed SAM framework.

3.1. Preliminaries

State space models, inspired by continuous systems, utilize an implicit latent state $h (t) \in R^{N}$ to map a 1-dimensional function or sequence $x (t) \in R \mapsto y (t) \in R$ , $h (t) \in R^{N}$ . This system uses $A_{s} \in R^{N \times N}$ as the evolution parameter and $B \in R^{N \times 1}$ , $C \in R^{1 \times N}$ as the projection parameters. Note that here $A_{s}$ denotes the state transition matrix of the SSM; later we use $A$ to denote the adjacency matrix of a graph. The continuous system works as follows:

\begin{aligned} h^{'} (t) & = A_{s} h (t) + B x (t), \\ y (t) & = C h (t) . \end{aligned}

(1)Here,

h^{'} (t)

denotes the derivative of the latent state with respect to continuous time

t

. Thus, Equation (1) models the instantaneous rate of change of the state at time

t

, where both

h (t)

and

x (t)

are naturally indexed by the same time variable.

Mamba is an enhancement of the S4 architecture. The SSM S4 represents a discrete version of continuous systems, converting continuous parameters into discrete parameters via a time step parameter $Δ$ to adapt to real-world discrete input data:

\begin{aligned} \bar{A_{s}} & = \exp (Δ A_{s}), \\ \bar{B} & = (Δ A_{s})^{- 1} (\exp (Δ A_{s}) - I) \cdot Δ B . \end{aligned}

(2)

After discretization transformation, given an input sequence $x$ , the dynamics are no longer expressed in terms of derivatives but as a recursive update:

\begin{aligned} h_{t} & = \bar{A_{s}} h_{t - 1} + \bar{B} x_{t}, \\ y_{t} & = C h_{t} . \end{aligned}

(3)The new state

h_{t}

depends on the previous state

h_{t - 1}

and the current input

x_{t}

A key challenge for sequence models such as RNNs and SSMs is the compression of context into a reduced state, which limits their effectiveness based on the extent of context compression. Mamba’s selective mechanism aims to retain a compact yet informative state, balancing efficiency and performance. Specifically, Mamba learns various transformations and applies them to $x$ to compute the parameters $Δ$ , $B$ , and $C$ , thus making these parameters functions of the input $x$ . The selective mechanism enables the model to adaptively select relevant information from the context.

3.2. Overview of Structural-Aware Mamba Framework

An overview of the proposed SAM framework is shown in Figure 1. The original Mamba is designed for one-dimensional sequential data. To handle graph-structured data, this paper proposes the SAM framework. The SAM framework includes a structural information encoder, a structure-aware selective mechanism, a symmetric input mechanism, as well as MPNN and final MLP modules. The SAM framework first inputs the graph into the SAM Encoder to extract structural features. The symmetric input mechanism addresses the unidirectional memory problem of Mamba (Figure 2(a)), enabling it to comprehensively consider information from all nodes in the graph. The structure-aware selective mechanism enhances Mamba’s inductive bias, giving it structural awareness to consider the graph’s structural information. The MPNN and MLP modules are used to further enhance SAM. To simplify the presentation, we omit the normalization and residual connections after each module. As shown in Figure 1, in the framework, we stack multiple SAMLayers. Each SAMLayer consists of a structure-aware selection SSM, a symmetric input mechanism(SYM), an MPNN, and an MLP.

Figure 2.

Comparison of input mechanisms for state space models. (a) Standard sequential processing follows the order $x_{1}^{l} \to x_{N}^{l}$ , where node $x_{i}^{l}$ can only access preceding nodes ${x_{1}^{l}, \dots, x_{i}^{l}}$ (gray arrows indicate blocked information flow from subsequent nodes $x_{i + 1}^{l}$ to $x_{N}^{l}$ ). (b) Our proposed method constructs a symmetric sequence $[x_{1}^{l}, \dots, x_{N}^{l}, x_{1}^{l}, \dots, x_{N}^{l}]$ through bidirectional concatenation, enabling each node to aggregate features from all graph nodes via forward and backward state transitions. This design eliminates positional bias while preserving the permutation invariance of graph data. For simplicity, the diagram omits the reversing operations of the symmetric input mechanism.

3.3. Structure-Aware Selection Mechanism

The original Mamba is designed for 1-dimensional sequential data, aligning well with inputs for NLP tasks. However, graph-structured data is inherently unordered, lacking a specific sequence, and node sequences do not contain structural information. Mamba is thus unsuitable for graph tasks that require structure-aware understanding. To address graph-structured data, we proposes a structure-aware module. The structure-aware module consists of a structural aware encoder and a structure-aware transformation function. The structural aware encoder is a trainable module used to extract structural information from the graph. As show in Figure 1, the input to the structural aware encoder is the entire graph, and the output is the structural information contained in each node of the graph.

\begin{aligned} S = SAMEncoder (X, E, A), \end{aligned}

(4)where

A \in R^{N \times N}

is the adjacency matrix of a graph with N nodes and E edges;

X \in R^{N \times d_{n}}

and

E \in R^{E \times d_{e}}

are the

d_{n}

-dimensional node and

d_{e}

-dimensional edge features, respectively;

SAMEncoder

is structural aware encoder with corresponding learnable parameters;

S \in R^{N \times d_{s}}

is the structural information output by the structure-aware encoder.

To enable the model to incorporate structural information from the graph to focus on or filter context information, we modified Mamba’s selective mechanism. The structure-aware transformation is a learnable weight that allows $Δ$ , $B$ , and $C$ to be functions of the node features $X$ and the node structural information $S$ . Specifically, the input to the structure-aware transformation is $X$ and the node structural information $S$ , and the output is $Δ$ , $B$ , and $C$ , as follows:

\begin{aligned} Δ, B, C = S_{SAM} (X, S), \end{aligned}

(5)where

S_{SAM}

is structure-aware transformation.

Then, according to Equations (2) and (3), the discretization and SSM processes yield the output $Y$ . At this point, these three parameters have structural awareness, allowing the model to adaptively select relevant information from the context by combining the features of the nodes themselves with the structural features contained in the nodes. Through the structure-aware module, the model possesses a global receptive field and has the capability of graph structure awareness.

3.4. Symmetric Input Mechanism

Since graph data does not have a fixed input order, directly inputting the node features of the graph into Mamba will result in most nodes being unable to utilize information from the entire graph. This is because nodes inputted earlier cannot use information from nodes inputted later. As illustrated in Figure 2(a), suppose the input to the $l$ -th layer of the model is $X^{l} = {x_{1}^{l}, x_{2}^{l}, \dots, x_{i}^{l}, x_{i + 1}^{l}, \dots, x_{N}^{l}}$ , where $x_{i}^{l}$ is the input of node $i$ at layer $l$ . When calculating the output for the $i$ -th node, it will be unable to utilize information from $x_{i + 1}^{l}$ to $x_{n}^{l}$ , leading to biases in the model’s performance. Therefore, we proposes a symmetric input mechanism, enabling the state space model to consider information from all nodes in the graph when computing the output for each node. The symmetric input module transforms the model’s input by reversing the input and concatenating it to the end of the original input sequence, forming a symmetric structure.

\begin{aligned} X_{r e v}^{l} & = reverse (X^{l}), \\ X_{n e w}^{l} & = X^{l} | | X_{r e v}^{l}, \\ X_{S}^{l + 1} & = reverse ({SAM}^{l} (X_{n e w}^{l}) [N + 1 : 2 N]), \end{aligned}

(6)where

reverse

is the reversing operation,

| |

is the concatenation operation, SAM is the Structure Aware Selection SSM, and

[N + 1 : 2 N]

denotes selecting the output features starting from

N + 1

to the end.

After reversing and concatenating, the sequence $X_{new}^{l}$ has a length of $2 N$ . This sequence is processed by the SAM. We extract the latter half of its output, restore the original order, and use it as input for the next part of the model. According to Equation (6), $X_{new}^{l} = {x_{0}^{l}, \dots, x_{n}^{l}, x_{n}^{l}, \dots, x_{0}^{l}} .$ According to Equation (3), when the state space model calculates $x_{i + 1}^{(l + 1)}$ , it considers the previous states, which include the graph information { $x_{0}^{l}, \dots, x_{n}^{l}, x_{n}^{l}, \dots, x_{i + 1}^{l}$ }. Therefore, each output of node can consider the global context information.

Flexible Symmetric Input. While the symmetric input mechanism enables bidirectional context integration and ensures global graph awareness by allowing all nodes to participate in target node computation, it inevitably doubles the input sequence length, leading to increased computational overhead and potential risks of overfitting and convergence difficulties. To address these limitations, we design a flexible symmetric input strategy that selectively activates the mechanism only in early layers while employing standard sequential input in subsequent layers. This approach significantly reduces computational costs while maintaining contextual information integration. As illustrated in Figure 2(b), our implementation activates the symmetric input solely in the first layer, followed by standard sequential processing in later layers (e.g., Layers 2-3). Crucially, this design preserves global graph context propagation across layers — for instance, node $x_{i}^{3}$ in Layer 3 can still access information from all nodes in Layer 1 through the bidirectional information flow established by the first layer symmetric processing.

3.5. Instances and Additional Details

The structural information encoder can be any neural network. In this paper, we use GatedGCN as an example of the structural information encoder in Equation (4). GatedGCN can aggregate and extract structural information from nodes and their neighborhoods, capturing the local structural features of the nodes. For structure-aware transformation $S_{SAM}$ in Equation (5), we specifically choose

\begin{aligned} S_{B} (X, S) & = Linear (X, S), \\ S_{C} (X, S) & = Linear (X, S), \\ S_{Δ} (X, S) & = softplus (Linear (X, S) + P a r a m e t e r_{Δ}), \end{aligned}

(7)as examples.

According to previous research (Dwivedi et al., 2022a; Rampášek et al., 2022), positional encodings (PE) provide positional information to the model, whereas structural encodings (SEs) supply structural information. To enhance the structure-aware framework, we consider an optional step to incorporate structural and positional encodings into the initial features of the nodes. Following the approach of GPS (Rampášek et al., 2022), we concatenate PE or SE with the node features. Additionally, at each layer, we aggregate the outputs of the MPNN layer with those of the global attention layer to update the features:

\begin{aligned} X_{M}^{l + 1}, E^{l + 1} & = MPNN (X^{l}, E^{l}, A), \\ X^{l + 1} & = ML P^{l} (X_{S}^{l + 1} + X_{M}^{l + 1}) . \end{aligned}

(8)

3.6. Comparison with Message Passing

In SAM, converting graph nodes into sequential inputs is a key design choice aimed at overcoming the limitations of traditional MPNN in terms of receptive field and computational efficiency. Although this serialization process may intuitively seem to weaken the structural inductive bias inherent to graphs, the model appears less ”aligned” with the graph structure compared to message passing; in fact, it offers several advantages. Both SAM and a range of GT-based methods adopt sequence modeling techniques that possess strong capabilities for capturing long-range dependencies, enabling each node to directly access information from any other node without suffering from the inefficiency of multi-hop propagation. To compensate for the potential loss of structural information during serialization, we design a structure encoder that leverages the strengths of message passing to explicitly inject structural constraints. This allows SSMs like Mamba to become aware of nodes’ relative positions and structural roles within the graph, thereby enhancing their overall representational capacity.

4. Experiment

We evaluated our framework on eight datasets, covering scenarios with long-range dependencies, small-scale (tens of thousands of graphs), large-scale (60,000 graphs), and large-graph (ten thousands of nodes) cases. Our framework demonstrated state-of-the-art performance in many situations. Finally, we conducted ablation studies on three datasets to evaluate the contributions of each module, including the symmetric input mechanism, SAM mechanism, MPNN, and structural or positional encoding.

4.1. Experiment Details

We evaluated our framework on datasets from various sources to ensure diversity. Initially, we conducted tests on two datasets from Benchmarking GNNs (Dwivedi et al., 2023), which are classical graph benchmarks intended to assess model performance on various graph prediction tasks, including graph classification (CIFAR10) and node classification (PATTERN). Next, we evaluated our method on the Long-Range Graph Benchmark (LRGB) (Dwivedi et al., 2022b), encompassing datasets designed to test the model’s ability to capture long-range dependencies in input graphs. We chose two datasets that cover graph classification (Peptides-func) and node classification (PascalVOC-SP). Furthermore, we evaluated the scalability and efficiency of our method using the Coauthor Physics dataset (Shchur et al., 2018), a large-graph dataset containing graphs with up to 34493 nodes each. In this evaluation, we compared our approach with Graph-Mamba, GatedGCN and traditional graph attention mechanisms and linear attention mechanisms under the GraphGPS framework. We assessed the training and inference times per epoch and the final accuracy of each method. Finally, we conducted ablation studies on three datasets to evaluate the contributions of the symmetric input mechanism, the SAM mechanism, the Graph-to-Sequence conversion and MPNN. Moreover, we performed a controlled comparison between our symmetric input mechanism and the direct fusion strategy adopted in bidirectional Mamba, further clarifying their distinctions and highlighting the advantages of our adaptive fusion design. Results are averaged over three runs with random seeds 0, 1, and 2, and are reported as mean and standard deviation. Additional details are provided as follows.

4.1.1. Dataset Description

The descriptions and details of the datasets used in our experiments are provided below. Table 2 shows the statistical data of the datasets.

Table 2.
Dataset Statistics.

Dataset Graphs Avg. nodes Avg. edges Prediction level Prediction task Metric

CIFAR10 60,000 117.6 941.1 graph 10-class classif. Accuracy

MNIST 70,000 70.6 564.5 graph 10-class classif. Accuracy

PATTERN 14,000 118.9 3,039.3 inductive node binary classif. Accuracy

Coauthor Physics 1 34493 247962 node 5-class classif. Accuracy

Amazon Computer 1 13381 245778 node 10-class classif. Accuracy

Amazon Photo 2 7487 119043 node 8-class classif. Accuracy

PascalVOC-SP 11,355 479.4 2,710.5 inductive node 21-class classif. F1

Peptides-func 15,535 150.9 307.3 graph 10-task classif. Average Precision

Dataset	Graphs	Avg. nodes	Avg. edges	Prediction level	Prediction task	Metric
CIFAR10	60,000	117.6	941.1	graph	10-class classif.	Accuracy
MNIST	70,000	70.6	564.5	graph	10-class classif.	Accuracy
PATTERN	14,000	118.9	3,039.3	inductive node	binary classif.	Accuracy
Coauthor Physics	1	34493	247962	node	5-class classif.	Accuracy
Amazon Computer	1	13381	245778	node	10-class classif.	Accuracy
Amazon Photo	2	7487	119043	node	8-class classif.	Accuracy
PascalVOC-SP	11,355	479.4	2,710.5	inductive node	21-class classif.	F1
Peptides-func	15,535	150.9	307.3	graph	10-task classif.	Average Precision

CIFAR10 and MNIST (Dwivedi et al., 2023) Image datasets adapted for graph classification via 8-NN graphs from SLIC superpixels, solving 10-class classification tasks. The standard dataset splits follow the original image classification datasets, i.e., for MNIST 55K/5K/10K and for CIFAR10 45K/5K/10K train/validation/test graphs.

PATTERN (Dwivedi et al., 2023) A synthetic node classification dataset based on the Stochastic Block Model, where the goal is to identify nodes belonging to specific subgraph patterns. We follow the splits proposed in Dwivedi et al. (2023), with 10,000 graphs for training, 2,000 for validation, and 2,000 for testing, and use them consistently across all models to ensure fair comparison.

PascalVOC-SP (Dwivedi et al., 2022b) A node classification dataset derived from Pascal VOC images via superpixelization, used for semantic segmentation of object classes. Following Dwivedi et al. (2022b), we adopt the predefined split of 8,498 graphs for training, 1,428 for validation, and 1,429 for testing, which is consistently applied across all models.

Peptides-func (Dwivedi et al., 2022b) A multi-label graph classification dataset of peptide molecular graphs from SATPdb, with long-range dependencies and large graph diameters. Following Dwivedi et al. (2022b), we adopt the predefined 70%-15%-15% split for training, validation, and testing, which is kept consistent across all models.

Physics (Shchur et al., 2018) A node classification dataset of co-authorship graphs from Microsoft Academic Graph, where nodes represent authors and edges represent co-authorship.We adopt the fixed split from Shirzad et al. (2023), with 60% for training, 20% for validation, and 20% for testing.

Amazon datasets (Shchur et al., 2018) Node classification datasets from Amazon co-purchase graphs, including Amazon Computer and Amazon Photo, where nodes are products and edges indicate frequently co-purchased items.We adopt the fixed split from Shirzad et al. (2023), with 60% for training, 20% for validation, and 20% for testing.

4.1.2. Baselines

We compared SAM with a set of popular MPNNs (GCN, GIN, GatedGCN), Graph Transformers (GraphGPS, SAN, Exphormer), and Graph-Mamba, which is related to our work.

GCN (Kipf & Welling, 2017) applies convolutional operations on graphs by aggregating neighborhood features to update node representations.

GIN (Xu et al., 2019) uses sum aggregation and MLPs to achieve high expressive power, matching the Weisfeiler-Lehman test.

GatedGCN (Bresson & Laurent, 2017) integrates gating mechanisms into graph convolutions for improved feature propagation and control.

GraphGPS (Rampášek et al., 2022) combines message passing and graph Transformers in a modular framework for rich representation learning.

SAN (Kreuzer et al., 2021) uses learned positional encodings and conditional attention to enhance structure-aware representation.

Exphormer (Shirzad et al., 2023) employs sparse attention via virtual nodes and expander graphs to scale efficiently to large graphs.

Graph-Mamba (Wang et al., 2024) replaces attention with the Mamba module in GraphGPS and uses degree-based sorting to process graphs effectively.

4.1.3. Hyperparameters

Given the large number of hyperparameters and datasets, we did not conduct an exhaustive or a grid search. We adopted several hyperparameter settings from GraphGPS(Rampášek et al., 2022). The final hyperparameter configurations are summarized in Table 3.

Table 3.
SAM Hyperparameters for Eight Datasets.

Hyperparameter CIFAR10 MNIST PATTERN PascalVOC-SP Peptides-func Physics Computer Photo

# SAM Layers 4 4 5 4 4 4 4 4

Hidden dim 52 52 96 96 96 72 80 64

Positional Encoding LapPE LapPE LapPE LapPE LapPE – – –

PE dim 8 8 16 16 16 – – –

Batch size 64 128 64 64 256 – – –

Learning Rate 0.005 0.005 0.001 0.006 0.01 0.001 0.01 0.001

# Epochs 100 100 100 300 200 70 150 100

# Warmup epochs 5 5 5 10 5 5 5 5

Weight decay 0.1 0.1 0.1 0.1 0.15 0.001 0.001 0.001

Optimizer AdamW AdamW AdamW AdamW AdamW AdamW AdamW AdamW

Hyperparameter	CIFAR10	MNIST	PATTERN	PascalVOC-SP	Peptides-func	Physics	Computer	Photo
# SAM Layers	4	4	5	4	4	4	4	4
Hidden dim	52	52	96	96	96	72	80	64
Positional Encoding	LapPE	LapPE	LapPE	LapPE	LapPE	–	–	–
PE dim	8	8	16	16	16	–	–	–
Batch size	64	128	64	64	256	–	–	–
Learning Rate	0.005	0.005	0.001	0.006	0.01	0.001	0.01	0.001
# Epochs	100	100	100	300	200	70	150	100
# Warmup epochs	5	5	5	10	5	5	5	5
Weight decay	0.1	0.1	0.1	0.1	0.15	0.001	0.001	0.001
Optimizer	AdamW	AdamW	AdamW	AdamW	AdamW	AdamW	AdamW	AdamW

4.2. Benchmarking GNNs

We evaluated our method on two datasets from the Benchmarking GNNs (i.e., CIFAR10, and PATTERN). The results are shown in Table 4. We observe that our method outperforms all baseline methods across these datasets, achieving the highest performance. This indicates that our method outperforms attention mechanisms in both node and graph classification tasks, highlighting the strong modeling capability of our structure-aware SSM in effectively handling common graph-based prediction tasks.

Table 4.
Test Performance on Benchmarking GNNs.

CIFAR10 PATTERN

Model Accuracy $↑$ Accuracy $↑$

GCN (Kipf & Welling, 2017) 0.5571 $\pm$ 0.0038 0.7189 $\pm$ 0.0033

GIN (Xu et al., 2019) 0.5526 $\pm$ 0.0153 0.8539 $\pm$ 0.0014

GatedGCN (Bresson & Laurent, 2017) 0.6731 $\pm$ 0.0031 0.8557 $\pm$ 0.0009

GPS (Rampášek et al., 2022) 0.7230 $\pm$ 0.0036 0.8669 $\pm$ 0.0006

SAN (Kreuzer et al., 2021) - 0.8658 $\pm$ 0.0004

Exphormer (Shirzad et al., 2023) 0.7475 $\pm$ 0.0019 0.8673 $\pm$ 0.0001

Graph-Mamba (Wang et al., 2024) 0.7370 $\pm$ 0.0034 0.8671 $\pm$ 0.0005

SAM(ours) 0.7605 $\pm$ 0.0040 0.8697 $\pm$ 0.0023

Highlighted are the top first, and second results.

	CIFAR10	PATTERN
GCN (Kipf & Welling, 2017)	0.5571 $\pm$ 0.0038	0.7189 $\pm$ 0.0033
GIN (Xu et al., 2019)	0.5526 $\pm$ 0.0153	0.8539 $\pm$ 0.0014
GatedGCN (Bresson & Laurent, 2017)	0.6731 $\pm$ 0.0031	0.8557 $\pm$ 0.0009
GPS (Rampášek et al., 2022)	0.7230 $\pm$ 0.0036	0.8669 $\pm$ 0.0006
SAN (Kreuzer et al., 2021)	-	0.8658 $\pm$ 0.0004
Exphormer (Shirzad et al., 2023)	0.7475 $\pm$ 0.0019	0.8673 $\pm$ 0.0001
Graph-Mamba (Wang et al., 2024)	0.7370 $\pm$ 0.0034	0.8671 $\pm$ 0.0005
SAM(ours)	0.7605 $\pm$ 0.0040	0.8697 $\pm$ 0.0023

4.3. Long-Range Graph Benchmark

We evaluated our method on two datasets from the Long-Range Graph Benchmark (LRGB). The evaluation results of our method on these datasets are shown in Table 5. Our method achieved competitive performance on these datasets, outperforming all MPNN and GT-based methods on the PascalVOC-SP, and Peptides-func datasets. This indicates that our framework possesses a significant advantage in capturing long-range dependencies in graph data compared to MPNN and GT. This advantage is attributable to our symmetric input mechanism and structure-aware mechanism, which endow Mamba with a global receptive field. This enables it to fully consider the structural information of the graph, thereby better capturing long-range dependencies in graph data.

Table 5.
Test Performance on Long-Range Graph Benchmarks (LRGBs).

PascalVOC-SP Peptides-func

Model F1 score $↑$ AP $↑$

GCN 0.1268 $\pm$ 0.0060 0.5930 $\pm$ 0.0023

GINE 0.1265 $\pm$ 0.0076 0.5498 $\pm$ 0.0079

GatedGCN 0.2873 $\pm$ 0.0219 0.5864 $\pm$ 0.0077

GPS 0.3748 $\pm$ 0.0109 0.6535 $\pm$ 0.0041

SAN 0.3230 $\pm$ 0.0039 0.6384 $\pm$ 0.0121

Exphormer 0.3966 $\pm$ 0.0027 0.6527 $\pm$ 0.0043

Graph-Mamba 0.4191 $\pm$ 0.0126 0.6739 $\pm$ 0.0087

SAM(ours) 0.4105 $\pm$ 0.0050 0.6796 $\pm$ 0.0073

The first and second best results are highlighted.

	PascalVOC-SP	Peptides-func
GCN	0.1268 $\pm$ 0.0060	0.5930 $\pm$ 0.0023
GINE	0.1265 $\pm$ 0.0076	0.5498 $\pm$ 0.0079
GatedGCN	0.2873 $\pm$ 0.0219	0.5864 $\pm$ 0.0077
GPS	0.3748 $\pm$ 0.0109	0.6535 $\pm$ 0.0041
SAN	0.3230 $\pm$ 0.0039	0.6384 $\pm$ 0.0121
Exphormer	0.3966 $\pm$ 0.0027	0.6527 $\pm$ 0.0043
Graph-Mamba	0.4191 $\pm$ 0.0126	0.6739 $\pm$ 0.0087
SAM(ours)	0.4105 $\pm$ 0.0050	0.6796 $\pm$ 0.0073

4.4. Larger Graphs

Traditional GT methods, owing to their quadratic complexity, exhibit poor scalability on large graphs and are typically restricted to molecular graphs with relatively few nodes. In contrast, our approach has linear complexity, prompting the question of whether it can scale to larger graphs.We evaluated our method on three large datasets, the largest of which is Physics, containing 34,493 nodes and 247,962 edges. GraphGPS and SAN, both of which also rely on quadratic attention mechanisms, encountered out-of-memory (OOM) errors on Physics (34,493 nodes) and Computer (13,381 nodes). Since each of these datasets consists of a single large graph, reducing the batch size was not feasible. We therefore attempted to lower the hidden dimension to 32, yet the quadratic memory growth still resulted in OOM, whereas Photo (7,487 nodes) could be processed successfully. As shown in Table 6, SAM scales to these large graphs and achieves the best performance among all baselines, demonstrating its ability to handle large-scale graphs with low resource overhead.

Table 6.
Accuracy of Models on Transductive Graph Datasets.

Model Computer Photo Physics

GPS OOM 0.9506 $\pm$ 0.0013 OOM

SAN 0.8983 $\pm$ 0.0016 0.9486 $\pm$ 0.0010 OOM

Exphormer 0.9159 $\pm$ 0.0031 0.9527 $\pm$ 0.0042 0.9716 $\pm$ 0.0003

Graph-Mamba 0.9066 $\pm$ 0.0061 0.9469 $\pm$ 0.0013 0.9695 $\pm$ 0.0020

SAM(ours) 0.9161 $\pm$ 0.0051 0.9559 $\pm$ 0.0037 0.9722 $\pm$ 0.0019

Model	Computer	Photo	Physics
GPS	OOM	0.9506 $\pm$ 0.0013	OOM
SAN	0.8983 $\pm$ 0.0016	0.9486 $\pm$ 0.0010	OOM
Exphormer	0.9159 $\pm$ 0.0031	0.9527 $\pm$ 0.0042	0.9716 $\pm$ 0.0003
Graph-Mamba	0.9066 $\pm$ 0.0061	0.9469 $\pm$ 0.0013	0.9695 $\pm$ 0.0020
SAM(ours)	0.9161 $\pm$ 0.0051	0.9559 $\pm$ 0.0037	0.9722 $\pm$ 0.0019

4.5. Efficiency Evaluation

We report the efficiency evaluation of our method on two large-scale graph datasets: Physics (34,493 nodes and 247,962 edges) and Photo (7,487 nodes and 119,043 edges). The efficiency and accuracy on both datasets are shown in Figure 3. To ensure fairness, we kept the number of layers, feature dimensions, and other parameters consistent across all models and conducted experiments on an A100 (40GB) GPU. We compare the models in terms of parameters, training time and accuracy, and evaluate their performance at 3, 5, and 8 layers. In the figure, line graphs connect scatter points for the same model with different layer counts. Since the traditional quadratic complexity Transformer used in GraphGPS causes out-of-memory issues, the GPS models shown in the figure all use lower-complexity alternatives.

Figure 3.

Evaluation of efficiency and accuracy of SAM and baseline methods. We plot model performance for 3, 5, and 8 layers, connecting scatter points for the same model with different layer configurations. The size of each scatter point corresponds to the number of parameters in the model. (a) Physics dataset and (b) Photo dataset.

Figure 3 shows that GatedGCN, despite having the shortest training time, attains lower accuracy and exhibits limited improvement as the number of layers increases on the Physics and Photo datasets. Our proposed SAM method outperforms all other methods in terms of accuracy. Its training speed is comparable to Graph-Mamba and other methods, and as the number of layers increases, accuracy improves. Moreover, as the number of layers grows, the training time of our method does not increase significantly compared to GPS:Bigbird. Compared to GPS:Performer, our method requires fewer parameters, demonstrating its good scalability and higher accuracy when handling large-scale graphs, outperforming attention-based mechanisms.

4.6. Ablation Study

4.6.1. Component Ablation

To ensure diversity, we evaluated the three main components of our framework—symmetric input mechanism, SAS mechanism, and MPNN—on datasets from Benchmarking GNNs (CIFAR10, PATTERN, and MNIST) and LRGB(Peptides-func). Additionally, we investigated the effect of flipping node features within the symmetric input mechanism. The results are shown in the Table 7: the first row presents the performance of our complete framework, while subsequent rows show the performance after removing individual components—flipping node features in the symmetric input mechanism, the symmetric input mechanism itself, the SAS mechanism, combinations of the symmetric input and SAS mechanisms, and MPNN—while keeping the other components unchanged. The results demonstrate that each component of our framework is effective, with all components contributing to performance improvements. Using flipped node features within the symmetric input mechanism further enhances model performance compared to simply repeating node features. In all datasets, combining the symmetric input mechanism with the structure-aware selective mechanism yields the best performance, demonstrating the effectiveness of our model framework.

Table 7.
Ablation Study on SAM Architecture.

CIFAR10 PATTERN MNIST Peptides-func

Model Accuracy $↑$ Accuracy $↑$ Accuracy $↑$ AP $↑$

SAM 0.7605 $\pm$ 0.0040 0.8697 $\pm$ 0.0023 0.9829 $\pm$ 0.0005 0.6796 $\pm$ 0.0073

w/o Reverse 0.7493 $\pm$ 0.0013 0.8615 $\pm$ 0.0064 0.9820 $\pm$ 0.0005 0.6757 $\pm$ 0.0056

w/o Sym. Input 0.7501 $\pm$ 0.0034 0.8677 $\pm$ 0.0020 0.9823 $\pm$ 0.0009 0.6747 $\pm$ 0.0078

w/o S.A.S 0.7374 $\pm$ 0.0040 0.8678 $\pm$ 0.0022 0.9781 $\pm$ 0.0006 0.6693 $\pm$ 0.0058

w/o S.A.S & Sym. Input 0.7394 $\pm$ 0.0015 0.8660 $\pm$ 0.0007 0.9776 $\pm$ 0.0004 0.6608 $\pm$ 0.0036

w/o MPNN 0.7116 $\pm$ 0.0048 0.8637 $\pm$ 0.0031 0.9719 $\pm$ 0.0005 0.6606 $\pm$ 0.0092

Highlighted are the top first results. S.A.S stands for structure-aware selection mechanism, and Sym. Input stands for symmetric inputmechanism.

	CIFAR10	PATTERN	MNIST	Peptides-func
SAM	0.7605 $\pm$ 0.0040	0.8697 $\pm$ 0.0023	0.9829 $\pm$ 0.0005	0.6796 $\pm$ 0.0073
w/o Reverse	0.7493 $\pm$ 0.0013	0.8615 $\pm$ 0.0064	0.9820 $\pm$ 0.0005	0.6757 $\pm$ 0.0056
w/o Sym. Input	0.7501 $\pm$ 0.0034	0.8677 $\pm$ 0.0020	0.9823 $\pm$ 0.0009	0.6747 $\pm$ 0.0078
w/o S.A.S	0.7374 $\pm$ 0.0040	0.8678 $\pm$ 0.0022	0.9781 $\pm$ 0.0006	0.6693 $\pm$ 0.0058
w/o S.A.S & Sym. Input	0.7394 $\pm$ 0.0015	0.8660 $\pm$ 0.0007	0.9776 $\pm$ 0.0004	0.6608 $\pm$ 0.0036
w/o MPNN	0.7116 $\pm$ 0.0048	0.8637 $\pm$ 0.0031	0.9719 $\pm$ 0.0005	0.6606 $\pm$ 0.0092

Figure 4.

Impact of the number of symmetric input layers (L) for SAM on CIFAR10 (left) and MNIST (right). An optimal balance is achieved at $L = 3$ , yielding higher accuracy with reduced computation. Over-parameterization degrades performance when exceeding the optimal number of layers.

Figure 5.

Impact of the number of symmetric input layers (L) on Peptides-func (left), Photo (middle), and Physics (right). An optimal balance is achieved at a moderate number of symmetric input layers, yielding higher accuracy with reduced computation. Performance degrades when the number of layers exceeds this optimal point due to over-parameterization.

4.6.2. Symmetric Input Layer Impact

We evaluated the impact of varying the number of symmetric input mechanism layers on the performance of SAM using the CIFAR10 and MNIST datasets. Specifically, we varied the number of symmetric input mechanism layers from 0 to 4 in both datasets, where $L = 0$ corresponds to a standard sequential processing and $L = 4$ indicates that all layers adopt the symmetric input mechanism. The results, shown in Figure 4 and Figure 5, indicate that incorporating the symmetric input mechanism significantly improves model accuracy compared to a fully standard sequential input ( $L = 0$ ). However, accuracy does not increase linearly with the number of symmetric input layers. An optimal balance is achieved at a moderate number of symmetric input layers, yielding higher accuracy with reduced computation.Performance degrades when the number of layers exceeds this optimal point due to over-parameterization. To further investigate the performance decrease observed when increasing the number of symmetric-input layers, we conducted additional experiments on the Photo dataset. Specifically, we measured the average gradient norm during training for $L = 1, 2,$ and $4$ symmetric-input layers. As shown in Figure 6, the gradient norms of all settings exhibit fluctuations between epochs 45 and 65. However, the magnitude of fluctuation increases with larger $L$ , indicating that deeper symmetric-input configurations lead to less stable optimization dynamics. This instability provides an explanation for the observed accuracy degradation when enabling the symmetric input mechanism beyond a moderate number of layers.

Figure 6.

Average gradient norm across training epochs on the Photo dataset with $L = 1, 2, 4$ symmetric-input layers. Larger $L$ leads to stronger fluctuations (notably between epochs 45–65), suggesting reduced optimization stability.

4.6.3. Role of Node Features in the Structure-Aware Selection Mechanism

To further investigate whether the selection mechanism benefits from direct access to node features, we evaluated two simplified variants: (i) using only positional/structural encodings (PE/SE) to compute the parameters of the structure-aware transformation module, and (ii) replacing node features with PE/SE in both the symmetric input mechanism and the Selection SSM. As shown in Table 8, both variants consistently underperform the original design across multiple datasets. This confirms that node features provide essential complementary information to PE/SE, and that their combination yields the strongest performance.

Table 8.
Ablation on Selection Mechanism Inputs: Node Features vs. Only PE.

PascalVOC-SP Peptides-func CIFAR10 PATTERN MNIST

Model F1 score $↑$ AP $↑$ Accuracy $↑$ Accuracy $↑$ Accuracy $↑$

Variant 1 0.3769 $\pm$ 0.0045 0.6783 $\pm$ 0.0054 0.7376 $\pm$ 0.0011 0.8670 $\pm$ 0.0008 0.9794 $\pm$ 0.0019

Variant 2 0.3808 $\pm$ 0.0073 0.6759 $\pm$ 0.0056 0.7348 $\pm$ 0.0043 0.8660 $\pm$ 0.0020 0.9805 $\pm$ 0.0012

SYM(ours) 0.4105 $\pm$ 0.0050 0.6796 $\pm$ 0.0073 0.7605 $\pm$ 0.0040 0.8697 $\pm$ 0.0023 0.9829 $\pm$ 0.0005

	PascalVOC-SP	Peptides-func	CIFAR10	PATTERN	MNIST
Variant 1	0.3769 $\pm$ 0.0045	0.6783 $\pm$ 0.0054	0.7376 $\pm$ 0.0011	0.8670 $\pm$ 0.0008	0.9794 $\pm$ 0.0019
Variant 2	0.3808 $\pm$ 0.0073	0.6759 $\pm$ 0.0056	0.7348 $\pm$ 0.0043	0.8660 $\pm$ 0.0020	0.9805 $\pm$ 0.0012
SYM(ours)	0.4105 $\pm$ 0.0050	0.6796 $\pm$ 0.0073	0.7605 $\pm$ 0.0040	0.8697 $\pm$ 0.0023	0.9829 $\pm$ 0.0005

Table 9.

Performance Comparison of Alternative Structure-Aware Encoders in SAM.

	PascalVOC-SP	Peptides-func	CIFAR10	PATTERN	MNIST	Computer	Photo	Physics
Model	F1 score $↑$	AP $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$	Accuracy $↑$
LapPE	0.3769 $\pm$ 0.0045	0.6783 $\pm$ 0.0054	0.7376 $\pm$ 0.0011	0.8670 $\pm$ 0.0008	0.9794 $\pm$ 0.0019	–	–	–
GCN	0.4215 $\pm$ 0.0139	0.6468 $\pm$ 0.0093	0.7356 $\pm$ 0.0012	0.8657 $\pm$ 0.0006	0.9788 $\pm$ 0.0005	0.9064 $\pm$ 0.0034	0.9484 $\pm$ 0.0028	0.9687 $\pm$ 0.0010
GIN	0.4037 $\pm$ 0.0187	0.6223 $\pm$ 0.0046	0.6579 $\pm$ 0.1132	0.8676 $\pm$ 0.0003	0.9788 $\pm$ 0.0003	0.7015 $\pm$ 0.1621	0.9377 $\pm$ 0.0022	0.9654 $\pm$ 0.0014
GatedGCN	0.4105 $\pm$ 0.0050	0.6796 $\pm$ 0.0073	0.7605 $\pm$ 0.0040	0.8697 $\pm$ 0.0023	0.9829 $\pm$ 0.0005	0.9161 $\pm$ 0.0051	0.9559 $\pm$ 0.0037	0.9722 $\pm$ 0.0019

4.6.4. Ablation on the Structure-Aware Encoder

Since the structure-aware encoder is a key component of SAM, we further investigated its design by comparing different alternatives. In our default design, we adopt GatedGCN, as it naturally integrates both node and edge features with a gating mechanism that strengthens message passing. To assess its effectiveness, we replaced it with three alternatives: GCN, GIN, and Laplacian positional encodings (LapPE).

As shown in Table 9, all encoders provide useful structural signals, but GatedGCN achieves the strongest or near-strongest performance across most datasets. This confirms our design choice. Moreover, GatedGCN better captures node roles (e.g., hubs, bridges) and topological patterns (e.g., motifs, substructures), which enrich the sequential modeling in Mamba. These results highlight the benefits of explicitly incorporating a powerful structural encoder, endowing SAM with a strong inductive bias toward structure-awareness.

4.6.5. Ablation on Graph-to-Sequence Conversion

Since Mamba requires sequential inputs, we investigated the impact of various graph-to-sequence conversion strategies. In our default design, we adopt a random node ordering, which avoids bias from deterministic heuristics and encourages the model to learn permutation-robust representations. To assess the effect of ordering, we further compared degree-based strategies (ascending and descending). As shown in Table 10, random ordering consistently achieves better performance across multiple datasets, while degree-based orderings result in noticeable degradation. This confirms that random ordering, combined with the symmetric input mechanism, provides effective robustness against node permutations.

Table 10.
Performance Comparison With Alternative Orderings.

Peptides-func CIFAR10 PATTERN Computer Photo

Model AP $↑$ Accuracy $↑$ Accuracy $↑$ Accuracy $↑$ Accuracy $↑$

Degree ascending 0.6484 $\pm$ 0.0101 0.7576 $\pm$ 0.0022 0.8653 $\pm$ 0.0010 0.9138 $\pm$ 0.0033 0.9431 $\pm$ 0.0047

Degree descending 0.6675 $\pm$ 0.0004 0.7574 $\pm$ 0.0039 0.8672 $\pm$ 0.0021 0.9146 $\pm$ 0.0020 0.9543 $\pm$ 0.0053

Random 0.6796 $\pm$ 0.0073 0.7605 $\pm$ 0.0040 0.8697 $\pm$ 0.0023 0.9161 $\pm$ 0.0051 0.9559 $\pm$ 0.0037

	Peptides-func	CIFAR10	PATTERN	Computer	Photo
Degree ascending	0.6484 $\pm$ 0.0101	0.7576 $\pm$ 0.0022	0.8653 $\pm$ 0.0010	0.9138 $\pm$ 0.0033	0.9431 $\pm$ 0.0047
Degree descending	0.6675 $\pm$ 0.0004	0.7574 $\pm$ 0.0039	0.8672 $\pm$ 0.0021	0.9146 $\pm$ 0.0020	0.9543 $\pm$ 0.0053
Random	0.6796 $\pm$ 0.0073	0.7605 $\pm$ 0.0040	0.8697 $\pm$ 0.0023	0.9161 $\pm$ 0.0051	0.9559 $\pm$ 0.0037

4.7. Comparison With Direct Fusion in Bidirectional Mamba

To further clarify the difference between our symmetric input mechanism and the direct fusion strategy used in bidirectional Mamba (Behrouz & Hashemi, 2024; Liu et al., 2024b; Zhu et al., 2024), we conducted a controlled ablation experiment. In bidirectional Mamba, the input sequence is independently processed in both forward and backward directions, and the resulting representations are directly summed to obtain the final output. This rigid summation treats both directions equally, regardless of their contextual relevance. For fairness, we replaced our mechanism with direct fusion while keeping all other components unchanged. As shown in Table 1, direct fusion consistently underperforms our method across all datasets, confirming that rigid summation degrades performance, while our mechanism achieves adaptive and more effective fusion.

5. Conclusion

This work proposes a framework for graph representation learning based on selective state-space models, representing an attempt to apply selective state-space models, exemplified by Mamba, to graph tasks. Our work demonstrates how selective state-space models can be adapted for graph tasks by processing input sequences and structural features. Experiments on two datasets from LRGB , two datasets from Benchmarking GNNs and large graph datesets show that our method achieves excellent performance, surpassing most MPNN and GT-based methods. Additionally, our method shows superior time efficiency compared to the quadratic complexity of GTs. The experimental results validate the modeling capabilities and high efficiency of our model, indicating that our framework has the potential to set a research trend for next-generation graph representation learning methods.

Footnotes

Acknowledgments

This research was supported by the National Natural Science Foundation of China (grant no. 62272399), the Key Program Foundation of Fujian Province, China (grant no. 2021J02006), and the Fundamental Research Funds for the Central Universities (grant nos. 20720220005 and 20720220006)

ORCID iD

Xiaoping Min

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Natural Science Foundation of China (grant no. 62272399), the Key Program Foundation of Fujian Province, China (grant no. 2021J02006), and the Fundamental Research Funds for the Central Universities (grant nos. 20720220005 and 20720220006).

Declaration of Conflicting Interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and publication of this article.

Data Availability Statement

All datasets used during the current study are public, and obtained from the followings URLs: https://mal-net.org/, https://github.com/vijaydwivedi75/lrgb, https://github.com/graphdeeplearning/benchmarking-gnns and .

References

Abboud

Dimitrov

Ceylan

I. I.

(2022). Shortest path networks for graph property prediction. In Learning on graphs conference.

Azabou

Ganesh

Thakoor

Lin

C. H.

Sathidevi

Liu

Valko

Veličković

Dyer

E. L.

(2023). Half-hop: A graph upsampling approach for slowing down message passing. In International conference on machine learning.

Bai

Yao

Wang

(2020). Adaptive graph convolutional recurrent network for traffic forecasting. Advances in Neural Information Processing Systems, 33, 17804–17815.

Barbero

Velingker

Saberi

Bronstein

M. M.

Giovanni

F. D.

(2024). Locality-aware graph rewiring in GNNs. In The Twelfth international conference on learning representations.

Behrouz

Hashemi

(2024). Graph mamba: Towards learning on graphs with state space models. arXiv preprint arXiv:2402.08678.

Bresson

Laurent

(2017). Residual gated graph convnets. arXiv preprint arXiv:1711.07553.

Busch

Seidl

(2020). Pushnet: Efficient and adaptive neural message passing. In ECAI 2020. IOS Press.

Chen

Lin

Zhou

Sun

(2020). Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34).

Chen

Gao

(2023). NAGphormer: A tokenized graph transformer for node classification in large graphs. In The Eleventh international conference on learning representations.

10.

Deac

Lackenby

Veličković

(2022). Expander graph propagation. In Learning on graphs conference.

11.

Diao

Loynd

(2023). Relational attention: Generalizing transformers for graph-structured tasks. In The Eleventh international conference on learning representations.

12.

Di Giovanni

Giusti

Barbero

Luise

Lio

Bronstein

M. M.

(2023). On over-squashing in message passing neural networks: The impact of width, depth, and topology. In International conference on machine learning.

13.

Dwivedi

V. P.

Joshi

C. K.

Luu

A. T.

Laurent

Bengio

Bresson

(2023). Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43), 1–48.

14.

Dwivedi

V. P.

Luu

A. T.

Laurent

Bengio

Bresson

(2022a). Graph neural networks with learnable structural and positional representations. In International conference on learning representations.

15.

Dwivedi

V. P.

Rampášek

Galkin

Parviz

Wolf

Luu

A. T.

Beaini

(2022b). Long range graph benchmark. In Thirty-sixth Conference on neural information processing systems datasets and benchmarks track.

16.

Errica

Christiansen

Zaverkin

Maruyama

Niepert

Alesiani

(2023). Adaptive message passing: A general framework to mitigate oversmoothing, oversquashing, and underreaching. arXiv preprint arXiv:2312.16560.

17.

Faber

Wattenhofer

(2024). Gwac: Gnns with asynchronous communication. In Learning on graphs conference.

18.

Finkelshtein

Huang

Bronstein

M. M.

Ceylan

I. I.

(2024). Cooperative graph neural networks. In Forty-first international conference on machine learning.

19.

Gilmer

Schoenholz

S. S.

Riley

P. F.

Vinyals

Dahl

G. E.

(2017). Neural message passing for quantum chemistry. In International conference on machine learning.

20.

Dao

(2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.

21.

Goel

Gupta

Ré

(2022a). On the parameterization and initialization of diagonal state space models. In A. H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.), Advances in neural information processing systems.

22.

Goel

(2022b). Efficiently modeling long sequences with structured state spaces. In International conference on learning representations.

23.

Johnson

Goel

Saab

K. K.

Dao

Rudra

(2021). Combining recurrent, convolutional, and continuous-time models with linear state space layers. In A. Beygelzimer, Y. Dauphin, P. Liang, & J. W. Vaughan (Eds.), Advances in neural information processing systems.

24.

Gutteridge

Dong

Bronstein

M. M.

Di Giovanni

(2023). Drew: Dynamically rewired message passing with delay. In International conference on machine learning.

25.

Hooi

Laurent

Perold

LeCun

Bresson

(2023). A generalization of vit/mlp-mixer to graphs. In International conference on machine learning.

26.

Karhadkar

Banerjee

P. K.

Montufar

(2023). FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. In The Eleventh international conference on learning representations.

27.

Kipf

T. N.

Welling

(2017). Semi-supervised classification with graph convolutional networks. In International conference on learning representations.

28.

Kong

Chen

Kirchenbauer

Bruss

C. B.

Goldstein

(2023). Goat: A global transformer on large-scale graphs. In International conference on machine learning.

29.

Kreuzer

Beaini

Hamilton

Létourneau

Tossou

(2021). Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34, 21618–21629.

30.

Wang

Qiao

(2024a). Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977.

31.

Wang

Zhang

Coster

(2024b). Stg-mamba: Spatial-temporal graph learning via selective state space model. arXiv preprint arXiv:2403.12418.

32.

Liang

Zhou

Wang

Zhu

Zou

Bai

(2024). Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739.

33.

Liu

Yao

Zhan

Pan

(2024a). Gradformer: Graph transformer with exponential decay. arXiv preprint arXiv:2404.15729.

34.

Liu

Tian

Zhao

Xie

Wang

Liu

(2024b). Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166.

35.

Lin

Lim

Romero-Soriano

Dokania

P. K.

Coates

Torr

Lim

S. N.

(2023). Graph inductive biases in transformers without message passing. In International conference on machine learning.

36.

Min

Chen

Bian

Zhao

Huang

Zhao

Huang

Ananiadou

Rong

(2022). Transformer for graphs: An overview from architecture perspective. arXiv preprint arXiv:2202.08455.

37.

Park

Ryu

Kim

Woo

Yun

S. Y.

Ahn

(2023). Non-backtracking graph neural networks. In NeurIPS 2023 Workshop: New frontiers in graph learning.

38.

Qian

Manolache

Ahmed

Zeng

den Broeck

G. V.

Niepert

Morris

(2024). Probabilistically rewired message-passing neural networks. In the twelfth international conference on learning representations.

39.

Rampášek

Galkin

Dwivedi

V. P.

Luu

A. T.

Wolf

Beaini

(2022). Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35, 14501–14515.

40.

Shchur

Mumme

Bojchevski

Günnemann

(2018). Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868.

41.

Shirzad

Velingker

Venkatachalam

Sutherland

D. J.

Sinop

A. K.

(2023). Exphormer: Sparse transformers for graphs. In International conference on machine learning.

42.

Sonthalia

Gilbert

Durham

(2023). Relwire: Metric based graph rewiring. In NeurIPS 2023 Workshop on symmetry and geometry in neural representations.

43.

Topping

Giovanni

F. D.

Chamberlain

B. P.

Dong

Bronstein

M. M.

(2022). Understanding over-squashing and bottlenecks on graphs via curvature. In International conference on learning representations.

44.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

45.

Wang

Tsepa

Wang

(2024). Graph-mamba: Towards long-range graph sequence modeling with selective state spaces. arXiv preprint arXiv:2402.00789.

46.

Wang

Liu

Kurtin

(2023). Learning hierarchical protein representations via complete 3D graph networks. In the eleventh international conference on learning representations.

47.

Zhao

Wipf

D. P.

Yan

(2022). Nodeformer: A scalable graph structure learning transformer for node classification. Advances in Neural Information Processing Systems, 35, 27387–27401.

48.

Xing

Wang

Huang

Shi

(2024). Less is more: On the over-globalizing problem in graph transformers. arXiv preprint arXiv:2405.01102.

49.

Leskovec

Jegelka

(2019). How powerful are graph neural networks? In International conference on learning representations.

50.

Yang

Sun

Zhao

W. X.

Liu

Chang

E. Y.

(2017). A neural network approach to jointly modeling social networks and mobile trajectories. ACM Transactions on Information Systems, 35(4), 1–28.

51.

Yun

Jeong

Kim

Kang

Kim

H. J.

(2019). Graph transformer networks. Advances in Neural Information Processing Systems, 32, 11983–11993.

52.

Zaheer

Guruganesh

Dubey

K. A.

Ainslie

Alberti

Ontanon

Pham

Ravula

Wang

Yang

Ahmed

(2020). Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 17283–17297.

53.

Zhu

Liao

Zhang

Wang

Liu

Wang

(2024). Vision mamba: Efficient visual representation learning with bidirectional state space model. In Forty-first international conference on machine learning.

Structure-Aware Mamba: A Linear-Complexity Model for Graph Representation Learning With Global Receptive Fields

Abstract

Keywords

1. Introduction

2.1. Message Passing Neural Networks

2.2. Graph Transformers

2.2.1. Transformers Costs

2.3. SSMs

3. Method

3.1. Preliminaries

4. Experiment

4.1. Experiment Details

4.1.1. Dataset Description

4.1.3. Hyperparameters

4.6.1. Component Ablation

4.6.5. Ablation on Graph-to-Sequence Conversion

5. Conclusion

Footnotes

Acknowledgments

ORCID iD

Funding

Declaration of Conflicting Interest

Data Availability Statement

References