Abstract
In wireless sensor networks, optimizing the network lifetime is an important issue. Most of the existing works define network lifetime as the time when the first sensor node exhausts all of its energy. However, such time is not necessarily important. This is because when a sensor node dies, the whole network is likely to work properly. In this article, we first make an overall consideration of the demand of applications and define the network lifetime in three aspects. Then, we construct a performance evaluation framework for routing protocols. To achieve the optimization of network lifetime in all defined aspects, we propose a reinforcement-learning-based routing protocol. Reinforcement-learning-based routing protocol takes advantage of the intelligent algorithm of reinforcement learning to search for the optimal routing path for data transmission. In the definition of reward function, factors such as link distance, residual energy, and hop count to the sink are taken into account to cut down the total energy consumption, balance the energy consumption, and improve the packet delivery. Simulation results demonstrate that compared with energy-aware routing, BEER, Q-Routing, and MRL-SCSO, reinforcement-learning-based routing protocol optimizes the network lifetime in three aspects and improves the energy efficiency.
Keywords
Introduction
In wireless sensor networks (WSNs), each of the sensor nodes has limited energy supply, constrained computation, and communication ability. Therefore, network lifetime becomes the major concern in WSNs.1–3 With regard to network lifetime, most of the current researches define it as the time when the first sensor node exhausts all of its energy. However, such time is not necessarily important. This is because when a sensor node dies, the whole network is likely to work properly. For most applications of WSNs, they are concerned about whether the network can provide an acceptable service, which may focus on the percent of alive nodes, the connectivity to the sink, or the status of packet delivery. Thus, we define network lifetime in three aspects that are related to the above factors.
To prolong the network lifetime of WSNs, researchers have proposed methods such as mobile sink,4–6 cross-layer design,7–9 MAC protocol,10–12 and routing protocol.13–19 In this article, we focus on the research of routing protocol. For the proposed routing protocols for WSNs, according to the network structure, they can be categorized into flat routing and hierarchical routing. Our work is aimed at designing flat routing. Flat routing is suitable for smaller networks. In addition, for large-scale networks, hierarchical routing also demands a flat routing algorithm for intra-cluster communication. Among all of the flat routing protocols, energy-aware routing (EAR) 14 is one of the most typical protocols, and it has also been proved in the surveys20,21 to have much stronger energy efficiency than other flat routing protocols. To avoid the problem of always using the minimum energy path, EAR maintains multiple paths between source node and destination node and selects one of the paths to transmit data in a probability. This protocol has been compared with directed diffusion (DD), and the experiment results show that EAR can provide an overall improvement of 21.5% energy saving and an increase of 44% in network lifetime which is defined as the time of first-node-death. EAR has its inherent advantages, but it only considers the energy consumption of communication when determining the path selection probability. An improved version of the EAR protocol is presented. Different from EAR, balanced energy efficient routing (BEER) 15 not only considers the energy consumption of communication but also considers the residual energy of nodes and the number of paths including the forwarding node when choosing the routing path. Simulation results show that this protocol can further extend the death time of the first node. However, just as EAR, the flooding process in the setup phase and route maintenance phase will bring about much more additional overhead. In addition, in these two protocols, the data are transmitted in accordance with the established routing table. The routing table which has been built in advance cannot fully reflect the current network status.
In this article, we propose a reinforcement-learning-based routing (RLBR) protocol to solve the problems mentioned above and maximize the lifetime optimization of WSNs. Reinforcement learning (RL) is a sub-area of machine learning technique and deals with how an agent should take actions in an environment to maximize the long-term reward.22,23 The RL algorithm has its inherent advantages and is well suitable for dealing with distributed problems.24,25 In this algorithm, each possible action is assigned a
We make an overall consideration of the demand of applications and define the network lifetime of WSNs from these three aspects of the condition of alive nodes, the connectivity to the sink, and the status of packet delivery. Based on this, we further construct a performance evaluation framework for routing protocols of WSNs.
We propose an RL-based routing protocol to optimize the network lifetime of WSNs in all defined aspects. In this proposed protocol, the next forwarder is selected according to the historical learning information and the current estimation information, and the factors such as residual energy, link distance, and hop count are taken into account to learn the best paths. Such a way can make sensor nodes keeping better connectivity to the sink, balance the energy consumption among sensor nodes, decrease the total energy consumption, and increase the packet delivery.
We take schemes such as data packet carrying feedback and transmit power adjusting to further decrease the total energy consumption and improve the energy efficiency.
The rest of the article is organized as follows. In section “Related works,” we introduce the related works. In section “Performance evaluation framework for routing protocols in WSNs,” we discuss the definition of network lifetime and construct a performance evaluation framework for routing protocols. Then, in section “Proposed protocol: RLBR,” we detail the proposed protocol. In section “Performance evaluation,” we take simulation experiments to validate the performance of the proposed protocol. Finally, we conclude the article in section “Conclusion.”
Related works
In recent years, the machine learning technique has gained much attention. RL is a sub-area of machine learning, and it attempts to use computer programs to generate patterns or rules from large data sets. In the RL algorithm, the agent selects one action according to the patterns or rules and receives a reward from the environment. Then, the reward is used to update the patterns or rules. By such a learning process, the optimal results can be achieved. Due to the characteristics of RL, it is very suitable to deal with the distributed problems. Accordingly, some researchers use RL algorithm to solve the routing problem of WSNs. In this section, we will introduce the RL-based routing protocols in WSNs.
JA Boyan and ML Littman
28
proposed a basic Q-learning protocol “Q-Routing.” This protocol aims at increasing the rate of packet delivery and takes the minimal delivery time into account to learn the best paths. For each node, its each neighbor is assigned a
In addition, there are some RL-based routing protocols for specific scenarios. A Forster and AL Murphy
31
considered the multi-sink scenario and designed an energy-aware multicast routing protocol “Feedback routing for optimizing multiple sinks (FROMS).” FROMS attempts to minimize the energy dissipation while simultaneously delivering packets to multiple sinks. In FROMS, each node working as an agent learns the best hop costs to any combination of sinks. In the initialization phase, each sink broadcasts an announcement, and the hop counts of nodes to each sink are known. According to the information of hop counts, the initial
In recent years, some researchers still use RL to solve the routing problems of WSNs. MA Razzaque et al.
33
provided a distributed adaptive cooperative routing (DACR) protocol. In the DACR protocol, a lightweight RL method is used to update the routing strategy. In the learning process, the knowledge on reliability and delay is taken into account to determine the reward value. FTIEE, a hierarchical RL-based routing protocol, is proposed by F Kiani et al.
34
to prolong the network lifetime of WSNs. In the first step of the protocol, a new clustering method is applied to the network. The size of the clusters increases with increasing distance to the sink, and the RL technique is used to choose cluster heads. Then, the
Our work differs from the previous works. Table 1 lists the main differences. In our work, factors such as residual energy, link distance, and hop count are considered to learn the best paths, and schemes such as data packet carrying feedback and transmit power adjusting are taken. Our goal is to optimize the network lifetime in all defined aspects and improve the energy efficiency.
Differences between our work and the previous works.
AdaR: adaptive routing; ATP: adaptive tree protocol; FROMS: feedback routing for optimizing multiple sinks; RLBR: reinforcement-learning-based routing protocol.
Performance evaluation framework for routing protocols in WSNs
For WSNs, on one hand, network lifetime is an important metric for the performance evaluation. With regard to this issue, the pivotal problem is the particular meaning of network lifetime. On the other hand, for energy-constrained networks such as WSNs, energy efficiency reveals the work efficiency. Figure 1 illustrates our performance evaluation framework for routing protocols in WSNs. We evaluate the performance of routing protocols in terms of network lifetime and energy efficiency. For network lifetime, we define it from three aspects of the condition of alive nodes, the connectivity to the sink, and the status of packet delivery. For energy efficiency, it is related to two factors of the number of packet delivery and the total energy consumption.

Performance evaluation framework for routing protocols in WSNs.
Network lifetime
There are numerous publications having researched on the lifetime of WSNs. They define network lifetime as:
The time until the first sensor is drained of its energy;36–39
The time until the first cluster head is drained of its energy; 40
The time there is a certain fraction of surviving nodes in the network;41–43
The time until all nodes have been drained of their energy; 44
The time each target is covered by at least one node; 45
The time the whole area is covered by at least one node; 46
The number of total transmitted messages; 49
The time until connectivity or coverage is lost; 50
The time period during which the network continuously satisfies the application requirement. 51
Although there are various versions about the definition of network lifetime, they are only based on one of the following factors: number of alive nodes, connectivity, coverage, or quality of service. For most routing protocols proposed for WSNs, the evaluation of network lifetime is always based on a particular definition, and the most usual one is the time until the first node is drained of its energy. However, such time is not necessarily important. This is because when a sensor node dies, the whole network is likely to work properly. Most applications of WSNs care about whether the network can provide an acceptable service, which is related to the condition of alive nodes, the connectivity to the sink, and the status of packet delivery. Thus, we do an overall consideration of the demand of applications and define the network lifetime in three aspects.
Definition 1
Network lifetime: It contains three aspects: (1) the time until the first dead node appears; (2) the time until the first isolated node appears; and (3) the time until the network cannot accomplish any packet delivery.
In the definition, an isolated node is a node that has energy but has no path to the sink. This means that all the neighbor nodes of the isolated node have died. The first and second aspects show the moments at which the condition of alive nodes and the connectivity to the sink are changed. The third aspect denotes the time when the whole network cannot work any more. When evaluating the performance of network lifetime, these three moments rather than just a single one need to be evaluated.
Energy efficiency
The performance of energy efficiency shows the work efficiency of WSNs. We define it as follows:
Definition 2
Energy efficiency: The number of packet delivery by consuming unit energy, which can be calculated by equation (1)
where
At a certain point, the energy efficiency of the network is determined by the number of packet delivery and the total energy consumption at that moment.
Proposed protocol: RLBR
In WSNs, when a sensor node generates or receives a packet, it needs to send the packet to the sink node. If the sensor node cannot reach the sink node directly, it is necessary to select one of its neighbors to forward the packet. How to select neighbor nodes is a routing problem, and the routing problem can be considered as a Markov decision process (MDP). Such a problem can be solved by the algorithm of RL. An RL task is described as an MDP (S; A; P; R,52–54 in which S denotes the set of possible states, A indicates the set of possible actions, P represents the probability of state transition, and R symbolizes the environmental reward). The RL algorithm consists of two main parts: agent and environment. An agent perceives the current state of the environment and selects an action based on the current policy. Once taking an action, the agent will receive a reward from the environment. According to the reward, the agent updates its policy. In our protocol, the algorithm of RL is used and the following measures are taken to achieve the desired routing performance:
In the definition of reward function, residual energy of sensor node and link distance between nodes are taken into account to balance the energy consumption and decrease the total energy consumption.
The hop count to the sink is also considered to define the reward function, which can reduce the delay and indirectly improve the packet delivery.
In the RL-based routing, each node does not need global network information but can still approximate to the global optimization without additional cost.
Once a packet is generated, the correlative nodes find a routing path to deliver the packet to the sink. Each node searches for the next forwarder according to the up-to-date status rather than depending on the built routing table. Thus, the routing process is in accordance with the current condition of the network.
With regard to this condition that a node cannot find a neighbor to forward the packet, RLBR adopts two schemes. If the node has enough energy to reach the sink, it will adjust its transmission power to directly send the packet to the sink. Such a way solves the issue which is analogous to the void problem in geographic routing. Otherwise, the packet is dropped and the node is regarded as an isolated one. An isolated node has energy but has no path to the sink. After that, the isolated node will not be considered in the choice of next forwarder. Accordingly, the efficiency of path selection is improved.
Packet structure
In our model, there are two types of packets: control packet and data packet.
In network initialization phase, control packets are flooded from the sink. The structure of control packet is shown in Figure 2. The fields of node id, location coordinate, residual energy, and hop count indicate the information of the previous forwarder.

Structure of control packet.
In data communication phase, each sensor node sends a data packet to the sink every interval. The structure of data packet is defined in Figure 3. When a node hears a data packet, it first extracts the information of the previous forwarder including node id, location coordinate, residual energy, hop count to the sink, and

Structure of data packet.
Energy model
The first-order radio model, 16 a generally accepted energy model for WSNs, is used in RLBR. When a sensor node sends or receives a packet, its energy is lessened according to equation (2)
where
Protocol operation
In RLBR, each sensor node is an agent. For any sensor node

Flowchart of RLBR.
First, the network is initialized. During this phase, starting from the sink node, each node sends a control packet to its neighbor nodes. As shown in Figure 2, the sender’s location, residual energy, and hop count to sink node are included in the control packet. Once receiving a control packet, the node
where
where
Then, data packets are transmitted in the network. The structure of the data packet is shown in Figure 2. For each sensor node, when it receives or overhears a packet, it extracts the sender’s information and updates its neighbor table. If the current node is not the sensor node indicated by the field of Next Forwarder, it drops the packet. Otherwise, the current node will undertake the task of forwarding the packet. To forward the packet, the current node first looks up its neighbor table. If there is a record of the sink node in the neighbor table, the current node directly sends the packet to the sink. If not, the current node seeks candidate nodes that can be used as forwarders from its neighbor table. As a candidate neighbor node, it must meet the following conditions: (1) it is not an isolated node. That is, its
If there is no neighbor node meeting the above conditions, the current node becomes an isolated node and marks its
If there are multiple candidate neighbor nodes, the current node calculates the relevant
where
where
where (
where
As equation (6) shows, the reward
By the calculation of equation (5), the current node selects the candidate neighbor node with the maximal
where
where
After that, the current node updates the packet header with its own information including node id, location coordinate, residual energy, hop count to the sink, and
Protocol operation sample
To explain the protocol operation more clearly, we give an example as follows. Figure 5 shows the network topology and the initial conditions of the example. For each node, the network topology in Figure 5 only shows its neighbor nodes that can be used as candidate nodes. After network initialization, node

Network topology and initial conditions of the example.
In RLBR, the main steps of data transmission are illustrated in Figures 6–8. In the first round of data transmission, Table 2 is the neighbor table of node

The first round of data transmission in RLBR.

The second round of data transmission in RLBR.

The third round of data transmission in RLBR.
Neighbor table of node
RLBR: reinforcement-learning-based routing protocol.
Neighbor table of node
RLBR: reinforcement-learning-based routing protocol.
Neighbor table of node
RLBR: reinforcement-learning-based routing protocol.
Neighbor table of node
RLBR: reinforcement-learning-based routing protocol.
Neighbor table of node
RLBR: reinforcement-learning-based routing protocol.
Neighbor table of node
RLBR: reinforcement-learning-based routing protocol.
To show the differences between RLBR and other RL-based routing protocols, we select the most classical RL-based routing protocol “Q-Routing” as an example. Q-Routing considers the minimal delivery time to learn the best paths. In Q-Routing, the
After transformation, equation (11) is equivalent to equation (12)
where
where
The current node selects the candidate node with the minimal
where
Based on the network topology and initial conditions in Figure 5, the data transmission processes in Q-Routing are shown in Figures 9–11. In the first round of data transmission, the neighbor table of node

The first round of data transmission in Q-Routing.

The second round of data transmission in Q-Routing.

The third round of data transmission in Q-Routing.
Neighbor table of node
Neighbor table of node
Neighbor table of node
Neighbor table of node
Neighbor table of node
To sum up, Figures 12 and 13 illustrate the path selection results in RLBR and Q-Routing. In Q-Routing, the

Path selection results in RLBR: (a) the first round, (b) the second round, and (c) the third round.

Path selection results in Q-Routing: (a) the first round, (b) the second round, and (c) the third round.
Protocol analysis
In this section, we list the optimization measures in RLBR and analyze the protocol performance, which reveals the contributions of RLBR. RLBR takes the following measures to optimize the performance:
Once a packet is generated, the correlative nodes find a routing path to deliver the packet to the sink. Each node searches for the next forwarder according to the up-to-date status rather than entirely depending on the routing table. That is to say, the routing process is in accordance with the current condition of the network. Such a way can make nodes keeping better connectivity to the sink.
In the process of data transmission, the current node chooses the next forwarder from the candidate nodes. As a candidate node, it must be closer to the sink than the current node to the sink. For example, if node A chooses node B as the next forwarder, and node B chooses node C as the next forwarder. Then, the distance from node A to the sink is greater than that from node B to the sink, and the distance from node B to the sink is greater than that from node C to the sink. For node C, it must select a node closer to sink as the next forwarder. Therefore, node C will not select nodes such as A and B to forward data. Thus, there is no routing loop in RLBR.
For each candidate node, the current node calculates its corresponding
The reward is also determined by the link distance. If the link distance is greater than the threshold, the reward is inversely proportional to the four power of the distance. Otherwise, the reward is inversely proportional to the square of the distance. That is to say, the shorter the link distance between the current node and the candidate node, the greater the possibility for selecting this candidate node as the next forwarder. According to the energy model of WSNs, the shorter the transmission distance, the less the energy consumption for transmitting data. Therefore, this scheme can decrease the total energy consumption.
In addition to residual energy and link distance, the reward is also affected by the hop count to the sink. The less the hop count from a candidate node to the sink, the greater the probability for this candidate node to be selected as the next forwarder. Such a way can lessen the probability of packet loss and quicken the packet delivery.
The scheme of data packet carrying feedback is taken in RLBR. When the current node sends the packet to the next forwarder, the previous node can also overhear a feedback. This feedback can be used to update the
With regard to this condition that a node cannot find a neighbor to forward the packet, two schemes are adopted in RLBR. If the node has enough energy to reach the sink, it will adjust its transmission power to directly send the packet to the sink. Otherwise, the packet is dropped and the node is regarded as an isolated one. An isolated node has energy but has no path to the sink. After that, the isolated node will not be considered in the choice of next forwarder. Accordingly, the efficiency of path selection is improved. These schemes can increase the packet delivery.
In WSNs, for a sensor node, the modules consuming energy include sensor module, processor module, and wireless communication module. However, in practical work, the energy consumption of a sensor node is mainly focused on the wireless communication module, and the energy consumption for sending data is the largest. Compared with the energy cost of sending data, the computation cost is negligible. For the whole network, if the total energy consumed by transmitting data is reduced and the energy consumption between nodes is more evenly, the time of the first dead node appearing will be postponed. Thus, the above measures 3, 4, and 6 can optimize the network lifetime in the first aspect. In addition, an isolated node means that the node is alive but the neighbor nodes are dead. If the protocol can make the nodes in the network consume energy more evenly or reduce the total energy consumption of data transmission, the time of the isolated node appearing can also be delayed. Therefore, due to the above measures 1, 3, 4, and 6, the second aspect of the network lifetime can be enhanced. For the third aspect of network lifetime, measures 5 and 7 can optimize the performance. Finally, since the above measures 4 and 6 can decrease the total energy consumption and measures 5 and 7 can improve the packet delivery, the performance of energy efficiency is optimized accordingly.
In summary, RLBR is different from the previous works. The main differences among RLBR, Q-Routing, and MRL-SCSO are shown in Table 13. For each routing protocol, Table 13 lists its main characteristics and the corresponding effects, with regard to the detailed contributions of RLBR, as mentioned above. In addition, the performances of RLBR, Q-Routing, and MRL-SCSO will be further compared and analyzed in section “Performance evaluation.”
The main differences among RLBR, Q-Routing, and MRL-SCSO.
RLBR: reinforcement-learning-based routing protocol.
Performance evaluation
In this section, we evaluate the performance of our proposed protocol RLBR in terms of the network lifetime which is defined in three aspects and the energy efficiency. We have implemented these protocols of EAR, BEER, Q-Routing, MRL-SCSO, and RLBR in NS2. EAR is used in our comparison due to its inherent advantages. It is a typical flat routing protocol and has been proved to have much stronger energy efficiency than other flat routing protocols. BEER is an improved version of EAR and has been validated to postpone the death of the first node. In addition, we also compare our protocol with other RL-based routing protocols. Among these RL-based routing protocols, Q-Routing is an early classical representative and MRL-SCSO is a recent representative. Thus, we select these four protocols as our comparison objects.
The simulation parameters are specified in Table 14. According to the parameters setup, 10 kinds of network topologies are randomly generated. Under each network topology, these four protocols are tested. The comparison results are shown as follows. It needs to be explained that the following results reveal the average situations.
Simulation parameters.
Network lifetime in the first aspect
First of all, we test the performance of network lifetime in the first aspect, which is defined as the time until the first dead node appears. Figure 14 illustrates the results of the percent of alive nodes over time. In EAR, BEER, Q-Routing, MRL-SCSO, and RLBR, the first node dies at about 50, 80, 30, 140, and 150 s, respectively. Clearly, compared to the other three protocols, RLBR extends the network lifetime. On average, RLBR yields 200%, 88%, 400%, and 7% longer lifetime over EAR, BEER, Q-Routing, and MRL-SCSO, respectively.

Percent of alive nodes over time.
As can be seen from Figure 14, Q-Routing has the shortest network lifetime. This is because it does not consider any energy-related factor to choose the routing path. In RLBR, once receiving or overhearing a packet, the sensor node selects one node as the next forwarder according to the
In MRL-SCSO, the sleep scheduling scheme can decrease the energy consumption. However, in RLBR, the scheme of data packet carrying feedback can also save energy. Moreover, RLBR takes link distance into account to define the reward function, which can further cut down the energy consumption. In RLBR, if the link distance is greater than the threshold, the reward is inversely proportional to the four power of the link distance. Otherwise, the reward is inversely proportional to the square of the link distance. The result is that the shorter the link distance, the larger the reward for the current node to send a packet to this neighbor node. Then, the possibility of choosing this neighbor node as the next forwarder is greater. Accordingly, the energy consumption for the current node to send a packet to the next forwarder is less. From a global perspective, this scheme can reduce the total energy consumption of data transmission. In addition, in RLBR, the residual energy of node is also considered to define the reward, which can balance the energy consumption. Thus, RLBR can prolong the network lifetime in the first aspect.
Network lifetime in the second aspect
Then, we pay attention to the network lifetime in the second aspect, which is defined as the time until the first isolated node appears. Figure 15 reveals the comparison of connectivity to the sink. The experiment was implemented for 10 rounds. The vertical coordinate indicates the time of the first isolated node appearing in each round. An isolated node has energy but has no path to the sink. Such a node cannot send any packet to the sink. Thus, it is equivalent to an invalid node. It can be seen that RLBR gains a significant improvement of network lifetime in terms of the second definition over EAR, BEER, Q-Routing, and MRL-SCSO.

Time of the first isolated node appearing in each round.
EAR and BEER select the next forwarder fully depending on the routing table, MRL-SCSO selects the next forwarder according to the current estimation information, and RLBR chooses the next forwarder according to the historical learning information and the current estimation information. The routing process in MRL-SCSO and RLBR is in accordance with the current condition of the network. Therefore, compared to EAR and BEER, MRL-SCSO and RLBR can make nodes keeping better connectivity to the sink. For Q-Routing, it also considers the historical learning information and the current estimation information to select the next forwarder. From this point of view, the connectivity performance of Q-Routing is better than that of EAR and BEER. However, the first dead node appears earliest in Q-Routing, which weakens the connectivity performance of Q-Routing. Thus, the performance of Q-Routing in the second aspect of network lifetime is slightly worse than that of EAR.
As for MRL-SCSO and RLBR, RLBR has the following advantages. First, in MRL-SCSO, the neighbor with the maximum reward value is selected as the next forwarder. That is to say, MRL-SCSO only considers the current estimation information to choose routing path, while RLBR considers the historical learning information and the current estimation information. Such a way in RLBR can provide better network connectivity. Second, an isolated node means that the node is alive but the neighbor nodes are dead. If the protocol can make the nodes in the network consume energy more evenly or reduce the total energy consumption of data transmission, the time of the isolated node appearing can be delayed. As discussed in section “Network lifetime in the first aspect,” the total energy consumption in RLBR is less than that in MRL-SCSO. Therefore, in the second aspect of network lifetime, the performance of RLBR is better than that of MRL-SCSO.
Network lifetime in the third aspect
For the third aspect of network lifetime, it is defined as the time until the network cannot accomplish any packet delivery. Figure 16 reflects the number of packet delivery over time. After 230, 270, 180, 400, and 500 s, the total number of packet delivery in EAR, BEER, Q-Routing, MRL-SCSO, and RLBR no longer changes. That is to say, at these times, the network can no longer complete any packet delivery. It is apparent that RLBR achieves longer lifetime over EAR, BEER, Q-Routing, and MRL-SCSO in the third aspect of network lifetime. In addition, as can be seen from Figure 16, although RLBR hands over fewer packets to the sink at the beginning, the superiority of RLBR is obvious at most of the time. After 110 s, RLBR delivers more packets than the other four protocols, and the gap is getting bigger and bigger. This is because RLBR needs to gradually learn from the environment. Initially, it evaluates the approximate goodness of each action. Afterward, it increasingly learns the real situation, which can be used to select the most appropriate path to transmit the packet.

Number of packet delivery over time.
In Q-Routing, the minimal delivery time is taken into account to learn the best paths, which can quicken the packet delivery. Therefore, Q-Routing delivers the largest number of packets at the beginning. However, after 110 s, more than half of the nodes in Q-Routing exhaust their energy. Accordingly, the number of new delivered packets in Q-Routing decreases. In RLBR, when a node becomes an isolated node, the scheme of adjusting transmit power is used to let the node make the last effort to send the packet to the sink, which can improve the packet delivery to a certain extent. Besides, different from EAR, BEER, Q-Routing, and MRL-SCSO, RLBR also considers the hop count to the sink to search for the routing path. Under the similar condition of energy-related factors, it encourages nodes to select the next forwarder nearer to the sink, which can reduce packet loss and quicken packet delivery. Therefore, RLBR can improve the packet delivery.
Summary of network lifetime
Figure 17 sums up the comparative results of network lifetime in three aspects. In EAR, the first node dies at 50 s, the first isolated node appears at 90 s, and the network cannot accomplish any packet delivery at 230 s. These three moments in EAR can be marked as (50 s, 90 s, 230 s). In BEER, Q-Routing, MRL-SCSO, and RLBR, the three moments are (80 s, 190 s, 270 s), (30 s, 80 s, 180 s), (140 s, 310 s, 400 s), and (150 s, 350 s, 500 s), respectively. It can be seen that RLBR has obvious advantages in three aspects of network lifetime.

Comparison of network lifetime in three aspects.
Energy efficiency
Finally, we test the performance of energy efficiency, which is defined as the number of packet delivery by consuming unit energy. Figure 18 shows the number of packet delivery over energy consumption. At first, in the case of consuming the same energy, MRL-SCSO and RLBR deliver less packets than EAR, BEER, and Q-Routing, which is due to the initial learning in MRL-SCSO and RLBR. Through continuous learning, the most appropriate path can be selected to transmit the packet, which will improve the packet delivery. Thus, the energy efficiency in MRL-SCSO and RLBR is gradually higher than EAR, BEER, and Q-Routing. For MRL-SCSO and RLBR, the energy efficiency of the latter is still higher than that of the former. Moreover, as the energy consumption increases, the difference of packet delivery between RLBR and the other four protocols is more obvious.

Number of packet delivery over energy consumption.
The predominance of RLBR is caused by its characteristics. For the energy consumption, first, when there is data needing to be transmitted, RLBR makes use of RL to compute and select the optimum routing path at nearly no additional costs. But in EAR and BEER, because of building and maintaining routing tables, there is extra energy overhead. Second, the reward in RLBR is influenced by the distance between the current node and the neighbor node. If the distance is greater than the threshold, the reward is inversely proportional to the four power of the distance. Otherwise, the reward is inversely proportional to the square of the distance. That is to say, the shorter the distance between the current node and the neighbor node, the larger the possibility for choosing this neighbor node as a forwarding node. Consequently, the energy consumption for the current node to send a packet to the next forwarder is less. This scheme can reduce the total energy consumption of data transmission in the network. Finally, in RLBR, the scheme of data packet carrying feedback can further save energy. For the packet delivery, RLBR considers the hop count to the sink to define the reward function to encourage nodes to select the next forwarder nearer to the sink. Such a way quickens the packet delivery and decreases packet loss and ultimately achieves an increase of packet delivery. In addition, RLBR takes the scheme of adjusting transmit power to let the node make the last effort to send the packet to the sink, which can improve the packet delivery to a certain extent. Therefore, RLBR can enhance the energy efficiency. For the applications of WSNs, RLBR can offer a better service with less cost.
Conclusion
Network lifetime is an important performance for WSNs. In this article, we have first defined the network lifetime of WSNs in three aspects and constructed a performance evaluation framework for routing protocols. Then, we have proposed an RL-based routing protocol for WSNs. RLBR makes uses of the superiority of RL to achieve the global optimization without additional cost. Moreover, it considers these factors of link distance, residual energy, and hop count to define the reward function and takes schemes such as data packet carrying feedback and adjusting transmit power to decrease the total energy consumption, balance the energy consumption, and improve the packet delivery. This protocol aims at enhancing the network lifetime of WSNs in all defined aspects and meeting the demand of such applications which are concerned about whether the network can provide an acceptable service. Although RLBR is a flat routing protocol, it can also be applied to the large-scale WSNs. In such networks, RLBR is able to handle the routing issue inside each cluster or among cluster heads. We have validated the performance of RLBR in NS2. RLBR shows superior performance over EAR, BEER, Q-Routing, and MRL-SCSO in terms of the percent of alive nodes, the connectivity to the sink, the number of packet delivery, and the energy efficiency. In future, we intend to test this protocol under real WSN environments like test-bed or deployments.
Footnotes
Handling Editor: Seokcheon Lee
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Fundamental Research Funds for the Central Universities (Grant No. 2232015D3-29), the National Natural Science Foundation of China (Grant No. 61772128), and the Shanghai Municipal Natural Science Foundation (Grant No. 14ZR1400900).
