A better-performing Q-learning game-theoretic distributed routing for underwater wireless sensor networks

Abstract

Underwater sensor networks have recently emerged as a promising networking technique for various underwater applications. However, the acoustic routing of underwater sensor networks in the aquatic environment presents challenges in terms of dynamic structure, high rates of energy consumption, long propagation delay, and narrow bandwidth. Therefore, it is difficult to adapt traditional routing protocols, which are known to be reliable in terrestrial wireless networks. In this study, we focus on the development of novel routing algorithms to tackle acoustic transmission problems in underwater sensor networks. The proposed scheme is based on reinforcement learning and game theory and is designed as a routing game model to provide an effective packet-forwarding mechanism. In particular, our Q-learning game paradigm captures the dynamics of the underwater sensor networks system in a decentralized, distributed manner. The results of a performance simulation analysis show that the proposed scheme can outperform existing schemes while displaying balanced system performance in terms of energy efficiency and underwater sensor networks throughput.

Keywords

Underwater sensor network reinforcement learning routing game model game theory distributed Q-learning

Introduction

Over 70% of the earth is covered with water in the form of rivers, canals, seas, and oceans. Therefore, underwater acoustic transmission has recently attracted much attention because of its significant abilities in distributed tactical surveillance, disaster prevention, mine reconnaissance, and environmental monitoring. Recent advances in communication technologies have led to the development of underwater sensor networks (USNs). Based on technologies that enable underwater exploration, USNs consist of sink nodes and autonomous micromechanical sensors. In USNs, spatially distributed sensors record water-related information; the sensors are connected wirelessly through acoustic signals in the underwater environment. Sink nodes on the surface collect data from the underwater sensors and transmit this information to a monitoring center via satellite for further analysis.^1–3

Routing is a fundamental issue in any network. In wireless sensor networks, routing protocols play an important role in the efficient transmission of data from the source to destination nodes. In general, network performance is strongly related to the routing algorithm; therefore, considerable efforts have been made to find adaptive network routing protocols. Ground-based terrestrial sensor networks have been comprehensively investigated, and numerous communication protocols have been proposed for such networks.^4–6 Although there are broad similarities between USNs and terrestrial sensor networks, certain characteristics of USNs mean that traditional routing concepts cannot be applied directly to the design of USN routing algorithms. To develop an effective USN routing algorithm, the high absorption factor, long propagation delay, limited bandwidth, and high bit-error ratio of the aquatic environment must be taken into account.²

Existing routing algorithms can be classified according to their design objectives and routing scenarios. For instance, an algorithm may operate in either a static offline or dynamic online manner. In addition, operational methods can be classed as centralized or distributed paradigms. In general, it is not practical to use a centralized controller in large-scale sensor networks. Moreover, static algorithms cannot significantly improve USN performance under dynamically changing acoustic environments. Control decisions must be made in real time without any knowledge of future information. Based on these considerations, distributed and dynamic routing methodologies are best suited to large-scale USN operations.^6,7

The aim of this article is to propose a new routing scheme for USNs by addressing various issues concerning the harsh underwater environment, most notably the energy efficiency, long propagation delay, dynamic topology changes, and node localization of aquatic operations. To overcome these issues, the proposed algorithm must balance the overall USN performance. In addition, for real-world USN applications, we focus on design principles such as feasibility, self-adaptability, and effectiveness in providing a desirable routing solution. Although several USN routing schemes have been proposed, no systematic routing study based on an integrated approach has been conducted. Existing research indicates that this is an extremely challenging issue.

In this study, we adopt game-theoretic learning algorithm to handle point-to-point routing decisions. Usually, game theory considers independent decision-making players who are attempting to reach a joint decision that is acceptable to all players under a given conflict/cooperation environment. In the proposed scheme, each sensor node is assumed to be a game player who makes routing decisions in a distributed online manner to adapt to the highly dynamic USN changes.

The main concept behind the proposed game is the extension of the well-known Markov decision problem to the multiplayer routing game model. To select the best strategy, each player in our game learns the routing environment according to a reinforcement learning algorithm. In particular, we demonstrate that insights from game theory can be used to derive a distributed Q-learning algorithm. In general, learning can be defined as the ability to make intelligent decisions by taking into account the past and present system states.⁷

This article makes the following contributions: (1) we develop a novel routing scheme for USNs; (2) we integrate game theory and the Q-learning algorithm to handle routing decisions; (3) we adopt a distributed online approach to implement self-adaptability and real-time effectiveness; (4) we design a routing algorithm that balances contradictory requirements; and (5) the probabilities for routing decisions are initially determined based on the current USN condition. The significance of our proposed scheme lies in its responsiveness to current USN system conditions while considering the dynamic underwater environment. To the best of our knowledge, there has been very little research on game-based distributed learning algorithms for USN routing problems.

The remainder of this article is organized as follows: In the next section, we review some related routing schemes and their problems. Then, we define the routing game model and Q-learning mechanism implemented in this study before providing a detailed description of the proposed USN routing algorithms. In particular, this section provides fresh insights into the benefits and design of a learning game-based routing approach. For convenience, the main steps of the proposed scheme are then listed before we validate the performance of the proposed scheme by means of a comparison with some existing methods. Finally, we present our conclusion and discuss ideas for future work.

Related work

There has been considerable research into the design of USN routing protocols. The energy-efficient depth-based routing (EEDBR) scheme³ performs energy balancing and reduces the number of sensor node transmissions in order to enhance the network. In this scheme, packets are transmitted between a sensor node and a sink by selected nodes. The selection of the nodes is based on depth and residual energy. The EEDBR scheme consists of two phases: knowledge acquisition and forwarding. During the knowledge acquisition phase, the sensor node shares depth and residual energy information with each of its neighbors. During the forwarding phase, the sender includes a list of forwarding nodes based on depth and residual energy. Unfortunately, some unnecessary forwarding means that the stability period of this scheme is short-lived and there is a high load on low-depth nodes.^2,3

Depth-based routing (DBR) is a novel protocol for transmitting packets toward the sink.⁸ Based on the depth information of each sensor, the DBR scheme forwards data packets greedily toward the water surface. Each data packet contains a field that records the depth information of its most recent forwarder, which is updated at every hop. On receiving a packet, the node checks whether its depth is less than that of the previous forwarder; if so, it forwards the packet; otherwise, it discards the packet. If there are multiple data sinks deployed at the water surface, DBR can naturally take advantage of them. The main advantages of the DBR scheme are as follows: (1) it does not require full-dimensional location information, (2) it can handle dynamic networks with high energy efficiency, and (3) it can take advantage of the multiple-sink network architecture without incurring additional overheads.⁸

The cooperative depth-based routing (CoDBR) scheme⁹ is a unique routing protocol for mission-critical applications. In this scheme, all nodes exchange depth information when the network is initialized. The source node registers its neighbors in a list. Relay nodes are selected based on the minimum-depth neighbors in the list, and data are transmitted through the established path to the sink. Cooperation is employed at the network layer, thus increasing the reliability and throughput of the network. In CoDBR, this high reliability and improved throughput is achieved at the cost of high end-to-end delay.^2,9

The data gathering–based node coordination (DGNC) scheme¹⁰ focuses on sparsely distributed USNs and a path-controllable autonomous underwater vehicle (AUV), with no time synchronization between the two. The DGNC scheme prolongs the lifetime of a sensor network using the AUV’s mobility to achieve the shortest wake-up time in the sensor nodes. In this scheme, the concept of received signal strength power control is used to modify the transmission power, which increases the communication reliability and decreases energy consumption. The DGNC scheme operates in two phases: connection and data transmission. Furthermore, the signal strength and signal-to-noise ratio (SNR) information provided by the acoustic modem can be used to apply adaptive power control during the transmission of data packets, thus conserving energy by obtaining a preferred communication position between the AUV and the sensor nodes.^2,10

In the cooperative best relay assessment (COBRA) scheme,¹¹ a new best relay selection criterion is employed to minimize the one-way packet transmission (OPT) time, which is defined as the packet transmission time plus propagation delays. The COBRA criterion can take both the propagation delay and the channel state of every potential relay into account, enabling the overall OPT time to be minimized when the total power and outage probability limitations are satisfied. A best relay selection algorithm has been proposed based on the COBRA criterion. This algorithm requires only statistical information about the channel, rather than the instantaneous channel state. The COBRA scheme achieves improved results in terms of network throughput and delivery ratio but incurs high energy consumption.^2,11

The void-aware pressure routing (VAPR) scheme¹² for USNs is an efficient anycast routing protocol for underwater data collection. In particular, the VAPR scheme uses surface reachability information to establish each node’s next-hop direction while taking advantage of geo-opportunistic forwarding. This approach is very robust to network dynamics such as node mobility and failure. The VAPR scheme does not require any additional recovery path maintenance and is not affected by the hop stretch caused by recovery fallbacks in existing solutions. Instead, the VAPR scheme exploits periodic beaconing to build directional trails toward the surface and features greedy, opportunistic directional forwarding for packet delivery.¹²

The geographic and opportunistic routing with depth adjustment–based topology control for communication recovery over void regions (GDAR) scheme¹³ is an anycast, geographic, and opportunistic routing protocol that routes data packets from sensor nodes to multiple sinks at the sea surface. The GDAR scheme uses location information of neighbor nodes and certain known sinks to select a next-hop forwarder set of neighbors to continue forwarding the packet toward the destination. To avoid unnecessary transmissions, low-priority nodes suppress their transmissions whenever they detect that the same packet has been sent by a high-priority node. When a node is in a communication void, the GDAR scheme switches to the recovery mode. This mode employs topological control by adjusting the depth of the void nodes, unlike traditional approaches that use control messages to discover and maintain routing paths along void regions. Therefore, the most significant aspect of the GDAR is its novel void node recovery methodology.¹³

The energy-efficient and lifetime-aware routing (ELAR) scheme¹⁴ is an innovative distributed routing protocol for USNs. In this scheme, the nodes learn the environment in order to take optimal actions and gradually improve the performance of the entire network. The ELAR scheme can easily be tuned to trade latency or energy efficiency for an extended lifetime and can thus be used in various applications. The major contributions of the ELAR scheme are (1) applying the Q-learning technique in a distributed routing protocol for USNs, (2) balancing the workload among sensor nodes for a longer network lifetime, (3) reducing the networking overhead for higher energy efficiency, and (4) learning the environment effectively to improve adaptability in dynamic networks.¹⁴

The work by Sandeep and Kumar¹⁵ provides a broad view of existing algorithms of clustering, coverage, and connectivity based on acoustic communication. This article also provides a useful guidance to the researchers in USNs from various other communication techniques’ perspective. Albarakati et al.¹⁶ develop a set of underwater embedded system architectures that can handle different network configurations. The idea is to have dynamic architecture that will be configured based on network parameters to achieve best performance in terms of end-to-end delay and power consumption. Finally, an architecture is selected to match a given set of requirements including data rate, processing node capabilities, gathering nodes capabilities, and water depth. In Azam et al.,¹⁷ a new balanced load distribution scheme is proposed to avoid energy holes created due to unbalanced energy consumption in USNs. This scheme is specifically designed to solve the energy hole problem when a node does not find a forwarder node in the next corona to reach the sink.

Some earlier studies^12–14 have attracted considerable attention while introducing unique challenges in handling the USN routing problems. In this article, we demonstrate that our proposed scheme significantly outperforms these existing schemes. Table 1 lists the advantages and disadvantages of the existing approaches.

Table 1.

Advantages and disadvantages of existing schemes.^2,3,8–14

Protocols	Advantages	Disadvantages
EEDBR scheme³	Based on depth and residual energy; localization-free protocol Minimizes energy consumption and end-to-end delay Sending-based forwarding	High load on intermediate nodes Delivery ratio decreased
DBR scheme⁸	Receiving-based forwarding; high packet receiving ratio Depth, broadcasting nature	Unbalanced energy consumption High load on low-depth nodes
CoDBR scheme⁹	Reliability and throughput efficiency; cooperative routing Based on depth; localization-free protocol	Unbalanced energy consumption and greater end-to-end delay
DGNC scheme¹⁰	Sensor nodes, AUV, node coordination scheme Communication reliability; reduced energy consumption	Possibility of missing beacon message by sensor node from AUV in sleeping mode, leading to failure of data transmission
COBRA scheme¹¹	Improved reliability and throughput Considers statistical properties of channel	Network density decreases rapidly because of cooperative routing by not considering residual energy of nodes
VAPR scheme¹²	Efficient anycast routing protocol for data collection Simple opportunistic directional forwarding mechanism	High load on intermediate nodes in the network
GDAR scheme¹³	Avoids unnecessary transmissions Effectively avoids void regions	Requires location information, which is an overhead
ELAR scheme¹⁴	Energy efficiency for lifetime. Applies the Q-learning technique in a distributed routing protocol.	Slight computational overhead

EEDBR: energy-efficient depth-based routing; DBR: depth-based routing; CoDBR: cooperative depth-based routing; DGNC: data gathering–based node coordination; COBRA: cooperative best relay assessment; VAPR: void-aware pressure routing; GDAR: geographic and opportunistic routing with depth adjustment–based topology control for communication recovery over void regions; ELAR: energy-efficient and lifetime-aware routing.

Routing game model and Q-learning mechanism

In this section, we provide a brief introduction to the game theory model and general reinforcement learning mechanism, which form the theoretical basis of the proposed USN routing scheme. By adopting the Q-learning-based routing game, our scheme designs an anycast, opportunistic routing protocol to deliver packets from a sensor node to the sink node.

Routing game model

During the operation of a USN system, the sensor nodes make individual routing decisions by considering the end-to-end delay, hop count, packet error rate, and energy efficiency. Under dynamic acoustic conditions, the sensor nodes attempt to maximize their own profit in a distributed online manner based on the learning approach. In this study, we assume that the sensor nodes are game players that make rational routing decisions. Based on this assumption, we develop a new game model, known as the routing game, for the USN system. Formally, we define the routing game model $G = {N, A_{i, 1 \leq i \leq n}, U_{i, 1 \leq i \leq n}, U_{i, 1 \leq i \leq n} [m], S_{i, 1 \leq i \leq n}, T}$ at each time period $t$ of gameplay:

$N$ is the finite set of game players $N = {δ_{1}, \dots, δ_{i}, \dots, δ_{n}}$ , where $δ_{i}$ is the ith sensor node and n is the number of sensors.

$A_{i} = {a_{i}^{1}, a_{i}^{2}, \dots, a_{i}^{m}}$ is the finite set of available actions of $δ_{i}$ , where $a_{i}^{m}$ represents the selection of the mth neighboring node to relay the routing packet; $m$ is the total number of neighbors.

$U_{i, 1 \leq i \leq n}$ is the payoff received by $δ_{i}$ .

The array $U_{i} [m] = [U_{i}^{1}, U_{i}^{2}, \dots, U_{i}^{m}]$ is the payoff history of $δ_{i}$ ; $U_{i}^{m}$ represents the accumulated payoff of strategy $a_{i}^{m}$ . $U_{i} [\cdot]$ is used to estimate the strategy selection probabilities.

$S_{i} = {s_{i}^{1}, s_{i}^{2}, \dots, s_{i}^{m}}$ is the set of states in $δ_{i}$ . The definition of system states is related to routing strategies. If $δ_{i}$ takes action $a_{i}^{m}$ , then the next state of $δ_{i}$ is $s_{i}^{m}$ .

$T = {0, 1, \dots, t, t + 1, \dots}$ denotes time, which is represented by a sequence of time steps with imperfect information.

Reinforcement learning mechanism

Reinforcement learning concerns how system agents should take actions to maximize their reward. Reinforcement learning algorithms establish a balance between the exploration of uncharted territory and the exploitation of current knowledge based on online performance. In game theory, reinforcement learning is used to explain how an effective solution may arise under bounded rationality. Therefore, there has been growing interest in research incorporating game theory and learning algorithms. Game theoretic learning algorithm has become an increasingly important area of research in recent years.^7,14,18

The environment is typically formulated as a Markov decision process (MDP), which is, therefore, the basic concept of reinforcement learning. The MDP is a mathematical framework for modeling the decision-making process and is useful for optimization problems that are solved via learning algorithms.^14,18 In the current scenario, the MDP consists of the set of states ( $S$ ), set of actions ( $A$ ), state transition probabilities ( $P$ ), and rewards ( $R$ ). $P$ is composed of elements $P_{s \to s'}^{a}$ , which are the probabilities of moving to state $s'$ from state $s$ (i.e. $s'$ , $s \in S$ ) with a given action $a \in A$ . $R (s, a)$ is the reward for taking action $a$ from state $s$ . Mathematically, at time $t$ , $P_{s \to s'}^{a}$ and $R (s, a)$ are defined as follows¹⁴

{\begin{matrix} P_{s \to s'}^{a} = \Pr {S_{t + 1} = S' |_{s = s_{t}, a = a_{t}}}, s . t ., \sum_{s' \in S} P_{s \to s'}^{a} = 1 \\ R (s, a) = {R_{t} (s_{t}, a_{t}) |_{s = s_{t}, a = a_{t}}} = \sum_{s_{t + 1} \in S} (P_{s_{t} \to s_{t + 1}}^{a_{t}} \times R_{s_{t} \to s_{t + 1}}^{a_{t}}) \end{matrix}

(1)

where $s_{t}$ and $a_{t}$ denote the state and action at time $t$ , respectively. $R_{s \to s'}^{a}$ is the reward expected for taking action $a$ at state s and entering the next state $s'$ . In the reinforcement learning algorithm, a policy $ψ$ is mapped from each state $s \in S$ and action $a \in A (s)$ to the probability $ψ (s, a)$ of taking action a when in state $s$ . The expected reward $(V^{ψ} (s))$ in state $s$ under policy $ψ$ is received after taking an action according to policy $ψ$ at time t and for following policy $ψ$ thereafter. $V^{ψ} (s)$ is defined as¹⁴

\begin{array}{l} V^{ψ} (s) = E_{ψ} {ℛ_{t} (s_{t}, a_{t})} \\ = E_{ψ} {\sum_{k = 0}^{\infty} (γ^{k} \times ℛ_{t + k} (s_{t + k}, a_{t + k}))}, s . t ., γ \in [0, 1) \end{array}

(2)

where $γ$ is a control factor that discounts the rewards in the future. Based on Bellman’s principle of optimality, equation (2) can be rewritten as a recursive equation

\begin{matrix} V^{*} (s) = \max_{ψ} V^{ψ} (s) = \max_{ψ} E_{ψ} {\sum_{k = 0}^{\infty} (γ^{k} \times R_{t + k} (s_{t + k}, a_{t + k}))} \\ = \max_{ψ} E_{ψ} {R_{t} (s_{t}, a_{t}) + \sum_{k = 1}^{\infty} (γ^{k} \times R_{t + k} (s_{t + k}, a_{t + k}))} \\ = \max_{a} [R_{t} (s_{t}, a_{t}) + (γ \times (\sum_{s_{t + 1} \in S} ((P_{s_{t} \to s_{t + 1}}^{a_{t}} \times V^{*} (S_{t + 1}))))] \end{matrix}

(3)

To solve equation (3), reinforcement learning algorithms can be used. In this study, we adopt a well-known reinforcement learning technique called Q-learning. Q-learning works by successively improving its evaluations of the quality of particular actions at particular states. In addition, Q-learning provides agents with the ability to learn to act optimally in Markovian domains by experiencing the consequences of their actions but does not require the agents to build maps of the domains.^14,18 In the Q-learning algorithm, the value of state–action pairs is given by $Q^{ψ} (s, a)$ , which is the value of taking action $a$ in state $s$ under policy $ψ$ . This is defined as

\begin{matrix} Q^{ψ} (s, a) = E_{ψ} {\sum_{k = 0}^{\infty} (γ^{k} \times R_{t + k} (s_{t + k}, a_{t + k}))} \\ = (R_{t} (s_{t}, a_{t}) + γ \times (\sum_{s_{t + 1} \in S} ((P_{s_{t} \to s_{t + 1}}^{a_{t}} \times V^{*} (S_{t + 1})))) \end{matrix}

(4)

According to equations (3) and (4), we can compute $V^{*} (s)$ as^14,18

\begin{matrix} V^{*} (s) = \max_{a} Q^{*} (s, a) = \max_{a} Q^{*} (s_{t}, a_{t}) \\ = \max_{a_{t}} [R_{t} (s_{t}, a_{t}) + γ \times (\sum_{s_{t + 1} \in S} ((P_{s_{t} \to s_{t + 1}}^{a_{t}} \times \max_{a} Q^{*} (S_{t + 1}, a))))] \end{matrix}

(5)

In equation (5), $Q^{*} (s_{t}, a_{t})$ can be approximated as¹⁴

Q^{*} (s_{t}, a_{t}) ≅ \lim_{t \to \infty} [{(1 - β) \times Q (s_{t}, a_{t})} + (β \times (ℛ_{t} (s_{t}, a_{t}) + {γ \times \underset{a}{m a x} Q (s_{t + 1}, a)}))], s . t ., β \in (0, 1]

(6)

where $β$ is a learning rate that models how the Q values are updated. The advantage of Q-learning is that it simply calculates the Q values of all actions based on the system model to update $V$ , which is less computationally demanding.^14,18 However, conventional Q-learning has the limitation that the reinforcement learning process is very slow.

Proposed USN routing scheme

In this section, the proposed USN routing scheme is explained in detail. The scheme consists of configuration and routing algorithms. The configuration algorithm effectively forms the network topology based on the current network condition by considering the distance and energy status. The routing algorithm establishes the one-hop behavior for transferring routing packets.

Configuration algorithm for underwater network topology

The key concept of the configuration algorithm is to configure the sensor network topology. Each sensor node maintains routing information in a control_record for self-organizing and independent control. The control_record consists of two system parameters: path cost ( $P C$ ) and the payoff history array ( $U [\cdot]$ ). $P C$ represents the degree to which communication can be adapted to reach the nearest sink node, whereas $U [\cdot]$ contains the payoff history for each action. The information in the control_record is used to determine adaptive routing decisions.

$P C$ is estimated based on the link costs in the route to the sink node. To estimate each wireless link between two sensor nodes, we define a link cost ( $L_{c}$ ). In each node, $P C$ is estimated as the sum of all $L_{c}$ values from the current node to the sink node. Using real-time monitoring of the remaining energy level of the node and the energy consumption of wireless communications, the value of $L_{c}$ from $δ_{i}$ to $δ_{j}$ ( $L_{c} (i, j)$ ) is obtained as follows

L_{c} (i, j) = [(1 - α_{i}) \times \frac{d_{i, j}}{D_{M}}] + [α_{i} \times (1 - \frac{E_{j}}{E_{M}})]

(7)

where $d_{i, j}$ is the distance from $δ_{i}$ to $δ_{j}$ and $E_{j}$ is the remaining energy of $δ_{j}$ . $E_{M}$ and $D_{M}$ denote the initial energy and the maximum coverage range of each sensor node, respectively. Therefore, $d_{i, j}$ and $E_{j}$ are normalized by $D_{M}$ and $E_{M}$ .

For the adaptive $L_{c} (i, j)$ estimation, the parameter $α_{i}$ controls the relative weights given to the distance and the remaining energy of the corresponding node. When the remaining energy of the current node $δ_{i}$ is high, we place more emphasis on the energy status of the next node $δ_{j}$ , that is, on $(1 - (E_{j} / E_{M}))$ . In this case, a higher value of $α_{i}$ is more suitable. When the remaining energy of node i is insufficient, the path selection should be strongly dependent on the energy dissipation for data transmission. In this case, a lower value of $α_{i}$ is preferable for the energy consumption rate, that is, on $(d_{i, j} / D_{M})$ , as the distance between two neighbor nodes directly affects the energy consumption rate. In the proposed configuration algorithm, the value of $α_{i}$ for the corresponding $δ_{i}$ is dynamically adjusted based on the rate of remaining energy with respect to the initially assigned energy $(E_{i} / E_{M})$ .

During network topology formation, each individual node estimates the $L_{c}$ values of its neighboring nodes. Nodes within one hop of the sink node set their $P C$ values to the one-hop $L_{c}$ value from themselves to the sink node. The remaining sensor nodes $δ_{i}$ set their $P C$ values as follows

P C_{i} = \min_{j} {P C_{j} + L_{c} (i, j)}, s . t ., j \in N_{i}

(8)

where $N_{i}$ is the set of neighboring nodes of $δ_{i}$ . Based on the $P C$ values of the neighbor nodes, $U [\cdot]$ is initialized as

U_{i} {[j]}_{j \in N_{i}} = P C_{j} + L_{c} (i, j)

(9)

When the proposed configuration algorithm has terminated, the virtual topology of the USN can be considered as a spanning tree structure that is rooted at the sink node while minimizing the $P C$ values. However, this is a temporary topology. During the routing operation, each node’s $P C$ and $U [\cdot]$ values are dynamically reestimated, and the network topology is also dynamically restructured.

Routing algorithm based on the Markov routing game

In designing the USN routing algorithm, we consider the multiplayer MDP. Each sensor node is assumed to be a player, and the objective of player $i$ ( $δ_{i}$ ) is to find the best routing policy $ψ_{i}^{*} \in A_{i}$ , where $A_{i}$ is the action set of $δ_{i}$ ; $| A_{i} | = | N_{i} |$ and $ψ_{i}$ represents a one-hop routing decision involving one of the neighboring nodes. Based on the routing decisions, each player statically determines its subsequent state. Let $s_{i} (t)$ and $a_{i} (t)$ be the state and action of $δ_{i}$ at time step $t$ , respectively, where $s_{i} (t) \in S_{i}$ and $a_{i} (t) \in A_{i}$ . $U_{i} (s_{i} (t), a_{i} (t))$ is the payoff to $δ_{i}$ at time $t$ when action $a_{i} (t)$ occurs in state $s_{i} (t)$ , and $U_{i} (s_{i} (t), a_{i} (t)) \to R$ . $U_{i} (s, a)$ can be interpreted as the reward ( $R$ ) in $Q^{ψ} (s, a)$ , that is, the expected discounted reinforcement of taking action $a$ in state $s$ . The function $U_{i} (s, a)$ is critical to Q-learning as it determines the behavior and performance of $δ_{i}$ .

By employing Q-learning, the main goal of the USN routing algorithm is to deliver the packet to the sink node with the maximum payoff. If the $m th$ neighboring node is selected as a relay node at time period $t$ , that is, $a_{i} (t) = a_{i}^{m}$ , $U_{i} (\cdot)$ can be defined as

\begin{matrix} U_{i} (s_{i} (t), a_{i} (t) = a_{i}^{m}) = R_{t} (s_{i} (t), a_{i}^{m}) = (\frac{1}{P C_{m}^{t} + L_{c} (i, j)}) \times {(- ϑ)}^{Q (a_{i}^{m})} \\ s . t ., Q (a_{i}^{m}) = {\begin{matrix} 0, & if this packet is successfully delivered by action a_{i}^{m} \\ 1, & otherwise \end{matrix} \end{matrix}

(10)

where $P C_{m}^{t}$ is the $P C$ value of the $m th$ neighboring node at time $t$ , and $ϑ$ is a penalty factor for packet delivery failure. According to $U_{i} (s_{i} (t), a_{i} (t) = a_{i}^{m})$ , the corresponding $U_{i}^{m}$ value in $U_{i} [m]$ is updated; $U_{i} [1, \dots, m] = {U_{i}^{1} (\cdot), U_{i}^{2} (\cdot), \dots, U_{i}^{m} (\cdot)}$ is a vector specifying the payoffs for $δ_{i}$ . $U_{i} [m]$ is updated as follows

\begin{matrix} U_{i} [m] = U_{i}^{m} (t + 1) = J + (\frac{1}{η} \times {U_{i} (s_{i} (t), a_{i} (t) = a_{i}^{m}) - J}) \\ s . t ., J = \frac{1}{t} \times (\sum_{e = 1}^{t} U_{i} (s_{i} (e), a_{i} (e))) \end{matrix}

(11)

where $η$ is a weight parameter for adjusting the current impact. At each time period, $δ_{i}$ chooses action $a_{i}$ to maximize $U_{i} (\cdot)$ and thus maximize $V$ , that is, $V^{*} (s) = \max_{a} Q^{*} (s, a)$ according to equation (5). A greedy Q-learning algorithm always chooses the action with the highest $Q$ value, which leads to packets being relayed to the sink node via the best route. To develop an effective Q-learning algorithm, we must define $P_{s_{t} \to s_{t + 1}}^{a_{t}}$ in equations (4) and (5). Defining $P_{s_{t} \to s_{t + 1}}^{a_{t}}$ is a probability decision problem. Through the multisensor routing game model, each sensor node adaptively learns the current USN state to dynamically determine $P_{s_{t} \to s_{t + 1}}^{a_{t}}$ . Suppose there is a finite set of neighboring sensors ${δ_{1}, \dots, δ_{m}}$ chosen by $δ_{i}$ at game iteration $t$ . According to equations (10) and (11), each individual node obtains the $U$ value and updates $U [\cdot]$ . Based on this information, we can determine $P_{s_{t} \to s_{t + 1}}^{a_{t}}$ in equations (4) and (5). Let $P_{i} (t) = {P_{s_{t} \to s_{t + 1}}^{a_{t}^{1}}, \dots, P_{s_{t} \to s_{t + 1}}^{a_{t}^{m}}}$ be the probability distribution of the actions selected by $δ_{i}$ . The values of $P_{i} (t)$ are defined as

\begin{matrix} P_{s_{t} \to s_{t + 1}}^{a_{t}^{m}} = P_{i}^{m} (S' | S, a_{i}^{m} (t)) = \frac{EXP (U_{i}^{m} (t))}{\sum_{l \in N_{i}} EXP (U_{i}^{l} (t))} \\ s . t ., EXP (X) = e^{X} \end{matrix}

(12)

Main steps of the proposed USN routing algorithm

USNs have gained popularity for their application to environmental monitoring, military surveillance, and disaster prevention. Although some USN routing work has been conducted, existing schemes are strongly specialized for specific control issues. Therefore, it is challenging to obtain balanced system performance. In this article, we discuss a new perspective on USN routing problems. Using the Q-learning-based routing game model, we design a novel USN routing scheme through a step-by-step interactive feedback process. This allows each sensor node to learn the current situation and determine the best routing path. After a sequence of routing actions has been performed, the current USN condition dynamically changes. During the routing operation, each node periodically updates the routing information, reevaluates the current strategy and selects one of its neighbor nodes for packet forwarding. In terms of practical operations, we can transfer the computational burden from a central system to distributed nodes. Therefore, our distributed learning-based routing scheme is implemented as a dynamic repeated game for opportunistic routing. While traditional optimal solutions generally exhibit exponential time complexity, our proposed scheme operates in polynomial time. The main steps of the proposed USN routing algorithm are described as follows (see Figures 1 and 2):

Step 1: Control parameters n, $γ$ , $α$ , $β$ , $E_{M}$ , $D_{M}$ , $η$ , and $ϑ$ are determined by the simulation scenario (see Table 2).

Step 2: At the initial time, the sink node broadcasts the setup message. Within the power coverage area, message-receiving sensors individually estimate $L_{c}$ according to equation (7) and forward the setup message recursively to their neighbor nodes.

Step 3: If a node has received multiple setup messages from reachable neighbor nodes, this node adaptively selects one of them to reach the sink node and initializes $P C$ and $U [\cdot]$ according to equations (8) and (9).

Step 4: At every routing decision step, the individual sensor nodes statistically select the next relay node according to the probability distribution $P (\cdot)$ .

Step 5: Based on the selected routing strategy, the values in $U [\cdot]$ are modified at each sensor node in a real-time online manner.

Step 6: At each time period, $P C$ , $U [\cdot]$ , and $P$ are reestimated at each sensor node based on equations (8), (11), and (12).

Step 7: Using the Q-learning mechanism, each node’s routing strategy at time $t$ is adaptively selected according to equations (5) and (6).

Step 8: Based on the interactive feedback mechanism, the dynamics of our routing game cause a cascade of interactions among the game players, who make their opportunistic routing decisions in a distributed fashion.

Step 9: Under the relevant USN environment, individual nodes are constantly self-monitoring for the next routing game process; go to step 4.

Figure 1.

Flowchart of the proposed USN routing algorithm.

Figure 2.

Pseudo code of the proposed USN routing algorithm.

Table 2.

System parameters used in the simulation experiments.

Parameter	Value	Description
n	500	Number of sensor nodes
$ϑ$	0.1	Penalty factor for packet delivery failure
$η$	0.8	Weight parameter for adjusting the current impact
e_dis	10 pJ/bit/m²	Energy dissipation coefficient for packet transmission
E_co	100 nJ/bit	System parameter for the electronic digital coding energy dissipation
B	1 Mb/s	Bandwidth of the wireless link
Packet	70 bits	Packet size for wireless communications
D_M	100 m	Maximum wireless coverage range of each node
E_M	100 J	Initial energy of each node
Parameter	Initial	Description	Values
$α$	1	Relative weights given to the remaining energy	0–1 (ε/E_M)

Performance evaluation

In this section, we evaluate the performance of our proposed protocol and compare it with that of VAPR,¹²GDAR,¹³ and ELAR.¹⁴ Based on the simulation results, we confirm the superiority of the proposed approach. In this study, we have used the simulation tool MATLAB to develop our simulation model. MATLAB is one of the most widely used tools in a number of scientific simulation fields; high-level syntax and dynamic types of MATLAB are ideal for model prototyping. Our simulation results are achieved using MATLAB, which is widely used in academic and research institutions as well as industrial enterprises. To ensure a fair comparison, the following assumptions and system scenario were used:

A 500-node USN was used in which the nodes were distributed randomly in a three-dimensional (3D) region of size 1000 m ×1000 m × 1000 m.

No assumptions were made about the node dispersion or density in the acoustic field.

The maximum wireless coverage range of each node was set to 100 m.

The meandering current mobility model¹⁹ was adopted to model the motility of each sensor node.

Rayleigh fading was used to model small-scale fading.

We used the carrier-sense multiple access (CSMA) media access control (MAC) protocol. In CSMA, when the channel is busy, a node waits for a predefined back-off period before attempting to sense the carrier again.

For reliability, we implemented automatic repeat-requests (ARQ) at the routing layer. After packet reception, the receiver sends back a short acknowledgment (ACK) packet. If the sender fails to hear the ACK packet, the data packet is retransmitted; the packet will be dropped after three retransmissions.

We used five sink nodes, each positioned at random on the surface.

Data packets were generated at the source node at a rate of λ (packets/s), and the range of the offered load was varied from 0 to 3.0.

At the start of the simulation, all nodes had an initial energy ( $E_{M}$ ) of 100 J.

For simplicity, we assumed that the nodes were not affected by noise or physical obstacles.

The energy dissipation in the transmitter node ( $E_{tran}$ ) depends on the square of the covered distance ( $d^{2}$ ) and the electronic digital coding energy ( $E_{co}$ ). Therefore, to transmit a k-bit packet, $E_{tran}$ can be defined as

E_{tran} = k \times (e_{dis} \times d^{2} + E_{co})

(13)

where d is the distance between the transmitter and receiver nodes and $e_{dis}$ is the energy coefficient.^20,21

The energy dissipation for the receiver node ( $E_{rec}$ ) only depends on the electronic digital coding energy ( $E_{co}$ ). Therefore, to receive a k-bit packet, $E_{rec}$ can be defined as

E_{rec} = k \times E_{co}

(14)

Network performance measures obtained on the basis of 100 simulation runs are plotted as functions of the packet generation per second (packets/s).

Usually, Rayleigh fading is a statistical model for the effect of a propagation environment on a radio signal, such as that used by wireless devices. It is most applicable when there is no dominant propagation along a line of sight between the transmitter and receiver. If there is a dominant line of sight, Rician fading may be more applicable. Rayleigh fading is a special case of two-wave with diffuse power (TWDP) fading. One of the limitations for the Rayleigh fading is its simplicity of assumption about dynamic wireless environments. In our simulation, we adopt the Rayleigh fading model for simplicity.

To demonstrate the validity of our proposed method, we measured the normalized energy consumption, network throughput, end-to-end delay, and packet loss ratio. Table 2 shows the system parameters used in the simulation. To emulate the USN system and ensure a fair comparison, we used the system parameters given in Table 2.

Figure 3 compares the packet loss ratio of each scheme. Packet loss is measured as the percentage of packets lost relative to the total number sent and is a key factor in USNs in terms of network operation. As the packet generation rate increases, the resultant network congestion results in increased packet loss. All schemes exhibit a similar trend; however, the proposed scheme outperforms the existing methods, particularly in heavy traffic load situations. In our scheme, packets are forwarded under an interactive environmental feedback mechanism, which guarantees a lower packet loss ratio than in other schemes.

Figure 3.

Packet loss ratio.

Figure 4 presents the normalized energy consumption per node for each scheme. Energy consumption increases with the packet generation rate, which is intuitively correct. In USNs, the main factor contributing to energy consumption is the communication distance. Therefore, it is important to determine the multiple-hop shortest path. Based on the online learning scheme, each node in our method is able to select the most energy-efficient routing path. Therefore, the proposed scheme attains superior energy efficiency to other schemes, from low to high traffic load intensities.

Figure 4.

Normalized energy consumption.

Figure 5 compares the network throughput. In this study, network throughput is defined as the ratio of data packets received at the sink node to the total number of data packets generated. The gain in network throughput achieved by the proposed scheme is a result of the effective feedback paradigm of employing an iterative learning model and the algorithm design for obtaining synergistic and complementary features. Therefore, the proposed scheme achieves superior network throughput performance to the existing schemes, which were designed as one-sided protocols and do not respond to current USN conditions.

Figure 5.

Network throughput.

The curves in Figure 6 indicate the normalized end-to-end packet delay. In general, as the packet generation rate increases, the packet delay increases linearly with the traffic load. At every period, each node in our scheme makes routing decisions in a distributed learning manner, reflecting changes in the network environment. Therefore, our approach reduces the packet delay more effectively than the other schemes.

Figure 6.

Normalized end-to-end delay.

The simulation results shown in Figures 3 –6 demonstrate that the proposed scheme, which uses a Q-learning-based routing game model, can monitor the current USN conditions and adapt to highly dynamic environments. In particular, the sensors in our approach acquire information from the environment, gain knowledge, and make intelligent decisions in a self-adapting manner. The simulation results indicate that the proposed scheme generally exhibits superior performance to the existing schemes. Our approach also attains an appropriate balance of performance, something that the VAPR,¹² GDAR,¹³ and ELAR¹⁴ schemes cannot offer.

Summary and conclusion

USNs have recently attracted considerable attention because of their significant capabilities for acoustic monitoring and resource discovery. USNs encompass a wide range of applications, from long-term measurements to imminent threats. In this article, we have presented a learning game approach to USN routing problems. Our major goal is to develop an efficient USN routing algorithm that is suitable for harsh underwater communication conditions. Considering the unique features of USNs, we have integrated an interactive feedback mechanism and designed a routing game model based on the well-known Q-learning concept. In the proposed model, individual sensor nodes learn better routing strategies and make routing decisions dynamically for adaptive, opportunistic routing. Using online self-monitoring and distributed learning techniques, the proposed scheme dynamically adapts to the current USN situation and effectively maximizes the expected benefits. To demonstrate the validity of our scheme, we compared our model with existing schemes and demonstrated that our approach outperforms the existing schemes in a simulation environment.

Although we have achieved our goals of energy efficiency and increased throughput in USN management, we believe that there is further scope for improving the efficiency of the USN system. Issues for further research include the design and validation of new USN control schemes in the fields of big data mining, cognitive radio, and network security. In addition, our work could be extended to error tolerance by investigating the effects of MAC layer activities such as packet loss and retransmission.

Footnotes

Handling Editor: Hassan Mathkour

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Ministry of Science and ICT (MSIT), Korea, under the Information Technology Research Center (ITRC) support program (IITP-2017-2014-0-00636) supervised by the Institute for Information & Communications Technology Promotion (IITP) and was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2015R1D1A1A01060835).

References

Ghoreyshi

Shahrabi

Boutaleb

. An opportunistic void avoidance routing protocol for underwater sensor networks. In: Proceedings of the IEEE 30th international conference on AINA, Crans-Montana, 23–25 March 2016, pp.316–323. New York: IEEE.

Shah

Ullah

Khan

et al . MobiSink: cooperative routing protocol for underwater sensor networks with sink mobility. In: Proceedings of the IEEE 30th international conference on AINA, Crans-Montana, 23–25 March 2016, pp.189–197. New York: IEEE.

Wahid

Kim

. An energy efficient localization-free routing protocol for underwater wireless sensor networks. Int J Distrib Sens N 2012; 8(4): 1–11.

Rizvi

Karpinski

Razaque

. Novel architecture of self-organized mobile wireless sensor networks. JCSE 2015; 9(4): 163–176.

Jang

Pyeon

Kim

et al . A survey on communication protocols for wireless sensor networks. JCSE 2013; 7(4): 231–241.

Raazi

Lee

. A survey on key management strategies for different applications of wireless sensor networks. JCSE 2010; 4(1): 23–51.

Sungwook

. Game theory applications in network design. Hershey, PA: IGI Global, 2014.

Yan

Shi

Cui

. DBR: depth-based routing for underwater sensor networks. In: Proceedings of the international conference on research in networking, Singapore, 5–9 May 2008, pp.72–86. Berlin: Springer.

Nasir

Javaid

Ashraf

et al . CoDBR: cooperative depth based routing for underwater wireless sensor networks. In: Proceedings of the IEEE international conference on broadband and wireless computing, communication and applications (BWCCA 2014), Guangzhou, China, 8–10 November 2014, pp.52–57. New York: IEEE.

10.

Ruoyu

Venkatesan

Cheng

. A new node coordination scheme for data gathering in underwater acoustic sensor networks using autonomous underwater vehicle. In: Proceedings of the IEEE wireless communications and networking conference (WCNC 2013), Shanghai, China, 7–10 April 2013, pp.4370–4374. New York: IEEE.

11.

Luo

Peng

et al . Effective relay selection for underwater cooperative acoustic networks. In: Proceedings of the IEEE international conference on mobile ad-hoc and sensor systems (MASS 2013), Hangzhou, China, 14–16 October 2013, pp.104–112. New York: IEEE.

12.

Noh

Lee

Wang

et al . VAPR: void-aware pressure routing for underwater sensor networks. IEEE T Mobile Comput 2013; 12(5): 895–908.

13.

Coutinho

RWL

Boukerche

Vieira

LFM

et al . Geographic and opportunistic routing for underwater sensor networks. IEEE T Comput 2016; 65(2): 548–561.

14.

Fei

. QELAR: a machine-learning-based adaptive routing protocol for energy-efficient and lifetime-extended underwater sensor networks. IEEE T Mobile Comput 2010; 9(6): 796–809.

15.

Sandeep

Kumar

. Review on clustering, coverage and connectivity in underwater wireless sensor networks: a communication techniques perspective. IEEE Access 2017; 5: 11176–11199.

16.

Albarakati

Amamra

Elfouly

et al . Reconfigurable underwater embedded systems architectures. In: Proceedings of the IEEE Symposium on ISCC, Heraklion, 3–6 July 2017, pp.1372–1379. New York: IEEE.

17.

Azam

Javaid

Ahmad

et al . Balanced load distribution with energy hole avoidance in underwater WSNs. IEEE Access 2017; 5: 15206–15221.

18.

Galindo-Serrano

Giupponi

. Distributed Q-learning for aggregated interference control in cognitive radio networks. IEEE T Veh Technol 2010; 59(4): 1823–1834.

19.

Caruso

Paparella

Vieira

LFM

et al . The meandering current mobility model and its impact on underwater mobile sensor networks. In: Proceedings of the IEEE INFOCOM, Phoenix, AZ, 13–18 April 2008, pp.771–779. New York: IEEE.

20.

Liu

Zhang

Akkaya

. Static worst-case energy and lifetime estimation of wireless sensor networks. JCSE 2010; 4(2): 128–152.

21.

Kim

Uno

Kim

. Adaptive QoS mechanism for wireless mobile network. JCSE 2010; 4(2): 153–172.