Sage Journals: Discover world-class research

Abstract

Cooperative communication has emerged as a new wireless network communication concept, in which parameter optimization such as cross-layer cooperation plays an important role. Heuristic evaluation postdecision state learning algorithm (HE-PDS) is proposed in cross-layer cooperation. The proposed algorithm exploits the determinate state information and jointly considers the transmitting power and channel state condition at the physical layer and the buffer congestion control at the media access control layer. The experimental results show that the cumulative average total costs of HE-PDS algorithm decrease about ten times and 8% under the maximum delay and throughput constraints and the power costs decrease about 50% and 28% under various delay limits and about 100% and 56% under the different throughput constraints than the traditional Q algorithm and PDS algorithm, demonstrating that the proposed algorithm has much better energy-efficient performance and faster convergence speed and outperforms the traditional Q learning algorithm and PDS learning algorithm.

1. Introduction

Recently, the merits of cooperative communication in the physical layer have been explored. However, the impact of cooperative communication on the design of the higher layers has not been well understood yet. As wireless devices often rely on battery power sources in wireless communication, how to minimize the energy consumption under the constraints on both delay and throughput has posed a great challenge and attracted lots of research attention in recent years [1]. Besides, affected by the fading channel state, time-varying buffer state, and dynamic traffic characteristics, this problem becomes more sophisticated [2]. Since the unknown environment can be modeled as a Markov decision process (MDP), it is reasonable to build the cross-layer transmission strategy based on this property [3, 4]. The state of the art of research on the energy-efficient problem in wireless communication can be mainly divided into two categories: the cross-layer design and approximate algorithm design. Related research is as follows.

From the view of energy-efficient design, the authors in [5] analysed the throughput performance. However, the feature of limited buffer has not been taken into account during the performance analysis. References [6–8] considered energy-efficient packet transmission under packet delay constraint. In [9, 10], the authors investigated the balance between throughput and the energy consumption. Although all these works obtained good energy-efficient performance, the trade-off among delay, throughput, and energy consumption is not fully considered. Aiming at the characteristics of MDP model, [6, 9–11] formulated the optimal packet transmission as a control policy which was solved by reinforcement learning (RL) algorithm. However, most of these works performed the computation offline which resulted in restricted application. In [7], the authors introduced the postdecision state (PDS) learning to raise the convergence rate. Unfortunately, the state of the transmission power of the model has not been taken into account. Although [11] took the power state into consideration, the effect of trade-off between exploration and exploitation is not been fully considered. Therefore, the convergence performance of the algorithm is needed to be further improved.

To address the aforementioned challenge, this paper extends our prior work [8] by considering the constraints of both delay and throughput simultaneously. We propose a heuristic evaluation postdecision state (HE-PDS) algorithm for packet transmission, which has not only low computation complexity but also faster convergence speed. The specific contributions of this paper include the following.

(i)

A literature survey about various existing energy-efficient policy, analyzing their advantages and disadvantages.

(ii)

An effective energy-efficient optimization models for decreasing the energy consumption is proposed in wireless communication.

(iii)

A unified framework to realize a scheduling mechanism is proposed by jointly considering the transmit power and channel state at the physical layer and the buffer congestion control at the media access control layer.

(iv)

An online RL algorithm is proposed that fully exploits the known state information about the system's dynamics to improve learning performance.

(v)

Performance analysis of the proposed algorithm and an evaluation of the algorithm with respect to other existing algorithms.

The rest of this paper is organized as follows. A brief overview of the related works is presented in Section 2. The formulation of the problem within the structure of CMDP is presented in Section 3. Section 4 presents an online HE-PDS algorithm for cross-layer optimization. Experiments are given in Section 5. Finally, Section 6 summarizes the anticipated results and discusses some future research directions.

2. Cross-Layer Cooperation Model

Cooperative communication will improve network performance in certain circumstances. However, if the cooperative communication is not necessary, it will make the system more complex, increase the transmission delay, and reduce the efficiency of the system. As illustrated in Figure 1, we consider a point to point system where one single user (a transmitter and receiver pair) transmits data from the finite buffer queue over a time-varying channel. Meanwhile, we divide the transmission time into equal slots of length $Δ t$ and time slot n denotes the discrete time interval $[n Δ t, (n + 1) Δ t]$ . We assume that the transmission and power management decisions are determined and the system state information remains unchanged in each time slot. According to the feedback of the delay, throughput and channel state information obtained at the receiver, the transmission rate, and transmission power are adaptively adjusted at the transmitter.

Figure 1

Wireless transmission system.

2.1. Physical Layer Model

We consider a discrete time block Rayleigh fading channel model with additive white Gauss noise (AWGN) [12, 13], where its power spectrum density is $N_{0} / 2$ and the wireless channel bandwidth is W. During each time slot, we assume that the power gain of the channel state is constant and the transfer of channel state only occurs in the adjacent states. In this paper, we use a finite state Markov channel (FSMC) to describe the wireless channel [14, 15]. As shown in Figure 2, there are k channel states each of which can be transitioned to its adjacent states with corresponding probabilities.

Figure 2

FSMC model.

In Rayleigh fading channel, the received instantaneous signal-to-noise ratio (SNR) φ is exponentially distributed with probability density function:

\begin{matrix} f_{φ} (φ) = φ_{0} \exp (- \frac{φ}{φ_{0}}), \end{matrix}

(1)

where $φ_{0} = E [φ]$ represents the average channel gain. The channel is said to be in state $h_{k}$ if the received SNR is in the interval [ $φ_{k}, φ_{k + 1}]$ . Let $N (φ)$ be the level crossing rate (LCR), which is given by

\begin{matrix} N (φ) = \sqrt{\frac{2 π φ}{φ_{0}}} \cdot f_{d} \cdot \exp (- \frac{φ}{φ_{0}}), \end{matrix}

(2)

where $f_{d}$ is the maximum Doppler frequency. Therefore, the state transition probability can be obtained by the following formula:

\begin{matrix} p_{h} (k, k + 1) = N (φ_{k + 1}) \cdot \frac{Δ t}{π_{k}}, 1 \leq k \leq K - 1, \\ p_{h} (k, k - 1) = N (φ_{k}) \cdot \frac{Δ t}{π_{k}}, 2 \leq k \leq K, \\ p_{h} (k, k) = 1 - p_{h} (k, k + 1), k = 1, \\ p_{h} (k, k) = 1 - p_{h} (k, k - 1), k = K, \\ p_{h} (k, k) = 1 - p_{h} (k, k + 1) - p_{h} (k, k - 1), k \neq 1, K, \end{matrix}

(3)

where the steady state probability (SSP) $π_{k}$ is given by

\begin{matrix} π_{k} = \int_{φ_{k}}^{φ_{k + 1}} f_{φ} (φ) d φ = \exp (- \frac{φ_{k}}{φ_{0}}) - \exp (- \frac{φ_{k + 1}}{φ_{0}}) . \end{matrix}

(4)

2.2. MAC Layer Model

As shown in Figure 3, let the transmission buffer be the first in first out queue. In the nth time slot, the transmitter receives $l_{n}$ packets, stores them in the finite buffer, and sends some packets from the buffer. The traffic arrival distribution is assumed to follow an independent and identical distribution (IID) during each slot. For simplicity, we assume that the packets arrival follows a Poisson process with rate λ. Therefore, the probability density with l packets arrival is denoted as

\begin{matrix} p_{l} (l) = \frac{λ^{l} \cdot \exp (- λ)}{l!} . \end{matrix}

(5)

Figure 3

Buffer timing diagram.

Afterwards, we define that the backlog at the transmitter buffer is denoted by $b \in [0, B]$ , where B is the capacity of the finite buffer and each packet contains L bits. Besides, the arrival packets will be dropped if the buffer is full. Meanwhile, we assume a packet arrival occurs at the end of time slot.

Let $z_{n}$ packets be sent at the transmitter in slot n, where $z_{n} \in {0,1, \dots, B}$ . Affected by the bit error ratio (BER), the packets received at the receiver may be smaller than $z_{n}$ ; that is, $f_{n}$ $({BER}_{n}, z_{n}) \leq z_{n}$ . Assuming independent packet losses, $f_{n}$ is represented by a binomial distribution:

\begin{matrix} p^{f} (f_{n} ∣ {BER}_{n}, z_{n}) = b_{int} (z_{n}, 1 - {PER}_{n}), \end{matrix}

(6)

where PER is the packet error ratio, which meets ${PER}_{n} = 1 - {(1 - {BER}_{n})}^{l_{n}}$ . We also define $b_{int}$ as the initial buffer state and the buffer state at the nth slot as $b_{n}$ . Therefore, the buffer state at the transmitter can evolve recursively as follows:

\begin{matrix} b_{0} = b_{int}, \\ b_{n + 1} = \min (b_{n} - z_{n} + l_{n}, B) . \end{matrix}

(7)

2.3. Dynamic Power Management Model

To reduce power consumption, we assume that the wireless card can turn to low power state similar to [8, 11, 16]. Specifically, the card may be one of the two power management states; that is, $X \in {on, idle}$ . Furthermore, the power state can be switched to on or idle by the corresponding actions in the set $Y = {s_{on}, s_{idle}}$ . We define $P_{on}$ and $P_{idle}$ as the power overhead by the wireless card in the on and idle states, respectively. Let $P_{tr}$ be the power consumption when the state transitions from on to idle or vice versa. In the nth slot, if the packet throughput is z, then the required power is

\begin{array}{l} ρ ([h_{n}, x_{n}], BE R_{n}, y_{n}, z_{n}) \\ = {\begin{cases} P_{idle}, & if x_{n} = idle, y_{n} = s_{idle}, \\ P_{on} + P_{t x} (h_{n}, BE R_{n}, z_{n}), & if x_{n} = on, y_{n} = s_{on}, \\ P_{tr}, & otherwise, \end{cases} \end{array}

(8)

where $h_{n}$ is the channel state; $x_{n}$ is the power management state; and $y_{n}$ and $P_{t x}$ are the power management action and the transmission power, respectively. Define u as the number of symbols per slot. Following the discussion in [10], the power required for the transmission is given by

\begin{matrix} P_{t x} \geq \frac{W \cdot N_{0}}{h_{n}} \cdot \frac{- \log (5 \cdot {BER}_{n}) \cdot (2^{z_{n} \cdot L / u} - 1)}{1.5} . \end{matrix}

(9)

In the implementation of power management action, we assume that the delay of the power state switching from one state to another is negligibly small. Let $P_{x} (y) = {[p (x^{'} ∣ x, y)]}_{x, x^{'}}$ represent the transition probability matrix, which means that the power state is switched from x to $x^{'}$ under the condition that the power management action is y. As shown in Figure 4, the sequence of the power management states can be modeled as a constraint Markov chain with transition probabilities.

Figure 4

State diagram of the power model.

3. Problem Formulation

As discussed in the second section, in a given channel state, since the energy consumption function is a convex function when z packets are transmitted, there must exist an optimal solution to this problem [17]. Given the buffer state b, channel state h, and power management state x, we define a joint vector state $s_{n} (b_{n}, h_{n}, x_{n}) \in S$ , where n denotes the nth time slot. Furthermore, we formulate this joint vector state process as a CMDP. Meanwhile, we use $a_{n}$ $({BEP}_{n}, y_{n}, z_{n}) \in A$ to represent the joint action, where BEP is the bit-error probability, y is the power management action, and z is the number of packets to be transmitted. For simplicity, we use buffer overhead instead of queue delay of the transmitter [8]. Consequently, the holding cost and overflow cost of the buffer can be obtained by the following:

\begin{array}{l} g_{holding} ([b, x], BER, y, z) = \sum_{f = 0}^{z} p^{f} (f ∣ BER, z) [b - f], \\ g_{overflow} ([b, x], BER, y, z) \\ = \sum_{l = 0}^{\infty} \sum_{f = 0}^{z} p^{l} (l) p^{f} (f ∣ BER, z) \max ([b - f] + l - B, 0), \end{array}

(10)

where the holding cost at the start of each slot stands for the number of packets that still remain in the buffer. We use parameter η to fully analyse the effect of the holding cost and overflow cost on wireless transmission. Thus, the buffer cost can be evaluated as

\begin{matrix} g ([b, x], BER, y, z) = g_{holding} + η \cdot g_{overflow} . \end{matrix}

(11)

For the average packet arrival rate λ, the system throughput can be calculated by

\begin{matrix} Throughput = λ (1 - p_{drop}) (1 - PER), \end{matrix}

(12)

where PER is packet error rate and $p_{drop} = (1 / λ) \lim_{N \to \infty} \sum_{n = 1}^{N} g_{overflow}$ is the long-term average buffer overflow probability. Therefore, the throughput maximization is equivalent to minimizing the total packet loss number, which contains the number of both buffer overflow and lost packets caused by BER. As in [10], its mathematical model can be defined as

\begin{matrix} δ = λ \cdot p_{drop} + z \cdot PER . \end{matrix}

(13)

In summary, we can reformulate the cross-layer energy-efficient transmission optimization as a problem of minimizing the long-term average power consumption under transmission delay and throughput constraints. Therefore, the optimization problem can be expressed as

\begin{array}{r} \underset{π}{\arg} \underset{N \to \infty}{limsup} \frac{1}{N} E {\sum_{n = 1}^{N} γ_{n} ρ (s, π (s_{n}) ∣ s_{0} = s)}, \\ subject to: \underset{N \to \infty}{limsup} \frac{1}{N} E {\sum_{n = 1}^{N} γ_{n} δ (s, π (s_{n}) ∣ s_{0} = s)} \leq T, \\ \underset{N \to \infty}{limsup} \frac{1}{N} E {\sum_{n = 1}^{N} γ_{n} g (s, π (s_{n}) ∣ s_{0} = s)} \leq D, \end{array}

(14)

where $γ (0 \leq γ \leq 1)$ is the discount factor; $π : S \to A$ is a stationary policy which maps system state into transmission rate for each time slot. T and D denote the throughput and delay constraints, respectively. Similar to [18], by introducing lagrange multipliers, $μ_{1}$ , $μ_{2}$ , this problem can be reformulated as an unconstrained MDP. Specifically, we define the system Lagrangian cost function as

\begin{matrix} c (s, a) = ρ (s, a) + μ_{1} \cdot δ (s, a) + μ_{2} \cdot g (s, a) . \end{matrix}

(15)

4. Heuristic Evaluation PDS Learning Algorithm

4.1. Algorithm Description

The state information of the environment is often assumed uncertain when the state-action pairs are learned in the traditional Q learning algorithm. Therefore, the known state information can not be fully utilized in the learning process which will inevitably result in poor convergence performance. However, the known information may be determined in most communication systems. Table 1 gives an example of what is known and what is unknown. In Figure 4, when $p_{on, idle}$ and $p_{idle, on}$ are known and determined, the $P_{x} (y)$ in Table 1 can be defined as known and determined. Besides, if the transmission power $P_{t x}$ in (8) is known, then the power consumption can be classified as known. Similarly, the packets arrival probability and holding cost also can be defined as known and stochastic when BER is known in (4).

Table 1

Classification of the dynamic environment.

Known	Determined	Power management state transition $P_{x} (y)$ Energy consumption (8)
Known	Stochastic	Channel state transition probability $p (k + 1, k)$ Packets arrival probability $p^{l} (l)$ Holding cost (12)

Unknown	Determined	N/A
Unknown	Stochastic	Overflow cost (12) Packet drop cost (13)

In order to use the known state information, we introduce postdecision state (PDS) and PDS value function as in [7]. In PDS learning algorithm, the search of optimal strategy is mainly performed by PDS. Specifically, as shown in Figure 5, the PDS is a virtual state of the system after performing a selected action. In addition, we further assume that the buffer state changes from the current state to the PDS, and, afterwards, the channel state and power management state change from the PDS to the next state.

Figure 5

PDS model.

Defining the PDS set ${\tilde{s}}_{n} ({\tilde{b}}_{n}, {\tilde{h}}_{n}, {\tilde{x}}_{n}) \in \tilde{S}$ , therefore, the system probability function can be organized as

\begin{matrix} p (s^{'} ∣ s, a) = {\tilde{p}}_{a} (p (s, a), s^{'}), \end{matrix}

(16)

where $p : S \times A \to \tilde{S}$ is the transition probability from the current state to the PDS, which decides the known impacts of the performed action a. The transition probability from PDS to the next state is defined as ${\tilde{p}}_{a} : \tilde{S} \times S \to [0,1]$ , which determines the stochastic impacts caused by the action a. The design objective for PDS learning is to obtain an optimal action $(a^{*})$ to maximize the long-term Q value denoted by $Q^{*} (s_{n}, a_{n})$ . Define the PDS value function for PDS learning algorithm as

\begin{matrix} {\tilde{Q}}^{*} (\tilde{s}) = \tilde{r} (\tilde{s}) + γ \sum_{s^{'}} {\tilde{p}}_{a} (s^{'} ∣ \tilde{s}, a) Q^{*} (s^{'}), \end{matrix}

(17)

\begin{matrix} Q^{*} (s) = \min_{a \in A} {r (s, a) + \sum_{\tilde{s}} p (\tilde{s} ∣ s, a) {\tilde{Q}}^{*} (\tilde{s})}, \end{matrix}

(18)

where $\tilde{r} (\tilde{s})$ is the immediate reward obtained from PDS to the next state. Meanwhile, the immediate reward obtained from the current state to the PDS is denoted by $r (s, a)$ . The discount factor $γ (0 \leq γ \leq 1)$ is the level of “foresight” in making decisions.

The optimal scheme can be calculated by the following formula in traditional Q learning [19]:

\begin{array}{l} π_{Q}^{*} (s) = \min_{a \in A} Q^{*} (s, a) \\ = \min_{a \in A} {r (s^{'} ∣ s, a) + γ \sum_{s^{'}} p (s^{'} ∣ s, a) Q^{*} (s^{'})}, \end{array}

(19)

where $r (s^{'}, a)$ is the reward obtained by taking action a in state s. From the proof described in Appendix A, the optimal strategy for PDS algorithm can be calculated by the following formula:

\begin{matrix} π_{PDS}^{*} (s) = \min_{a \in A} {r (s, a) + \sum_{\tilde{s}} p (\tilde{s} ∣ s, a) {\tilde{Q}}^{*} (\tilde{s})} . \end{matrix}

(20)

Although the PDS learning can reduce action exploration by using the determined information, the action does not balance the trade-off between the exploration and exploitation. To overcome the problem, we propose an HE-PDS learning algorithm that uses heuristic function and evaluation function to improve the algorithm performance. Specifically, the heuristic function stands for the importance when executing an action and the evaluation function for the feasibility. Thus, the optimal scheme can be written as follows:

\begin{matrix} π (s_{n}) = {\begin{cases} \arg \min_{a_{n}} [Q (s_{n}, a_{n}) + ε H (s_{n}, a_{n}) \\ + ω E (s_{n}, a_{n})], & if q \leq p, \\ a_{random}, & otherwise, \end{cases} \end{matrix}

(21)

where $a_{random}$ is an action randomly chosen among the available action set A, which means that a nonoptimal action is intentionally selected to obtain the information of the unknown state. Besides, ε and w are used to control the influence of the heuristic function and evaluation function, respectively; q is a random value in the interval (0, 1). The trade-off between exploration and exploitation is controlled by $p (0 \leq p \leq 1)$ . Specifically, if p is larger, the random selection probability is smaller. The heuristic function $H_{n} (s_{n}, a_{n})$ is used to affect the choice of the actions. However, since the majority of the actions cannot meet the optimal requirements, we use the evaluation function $E_{n} (s_{n}, a_{n})$ to reduce the number of the action to be selected. In order to minimize the error of the heuristic function and evaluation function, the corresponding definitions are given by

\begin{array}{l} H_{n} (s_{n}, a_{n}) \\ = {\begin{cases} \min Q (s_{n}, a) - Q (s_{n}, a_{n}) + σ, & if a_{n} = π^{H} (s_{n}), \\ 0, & otherwise, \end{cases} \end{array}

(22)

\begin{array}{l} E_{n} (s_{n}, a_{n}) \\ = {\begin{cases} \hat{Q} (s_{n}, a_{n}) + ε {\hat{H}}_{n} (s_{n}, a_{n}) \\ - \min_{a} (\hat{Q} (s_{n}, a), - ε {\hat{H}}_{n} (s_{n}, a)), & if fail, \\ 0, & otherwise, \end{cases} \end{array}

(23)

where σ is a small real value and $π^{H} (s_{n})$ is the action suggested by the heuristic policy. In order to ensure the validity of the exploration process for all state-action pairs, simulated annealing algorithm is used similar to [20]. Thus, the probability that the action a is executed in the current state is given by

\begin{matrix} p (a_{n} = a_{n + 1} ∣ s) = \frac{\exp (Q (s_{n}, a_{n + 1}) / τ_{n})}{\sum_{a \in A} \exp (Q (s_{n}, a) / τ_{n})}, \end{matrix}

(24)

where $τ_{n}$ is the temperature parameter, which controls randomness of the action selection.

In summary, the solving process of energy-efficient problem is as follows. In the nth slot, The HE-PDS first observes the current state $s_{n}$ and then, based on the observations, selects and executes an action $a_{n}$ . Finally, the algorithm obtains immediate reward $r (s, a)$ and $r (\tilde{s})$ and enters next learning cycle. During the learning process, the ${\tilde{Q}}_{n + 1} ({\tilde{s}}_{n})$ value can be adjusted by the following formula:

\begin{matrix} {\tilde{Q}}_{n + 1} ({\tilde{s}}_{n}) ⟵ (1 - α_{n}) \tilde{Q} ({\tilde{s}}_{n}) + α_{n} [\tilde{r} (\tilde{s}) + γ \min_{a \in A} Q_{n} (s_{n + 1})], \end{matrix}

(25)

where $α_{n}$ is the learning rate. $Q_{n + 1} (s, a)$ can converge to the optimal $Q^{*} (s, a)$ when the sequence of learning rates $α_{n}$ meets $\sum_{n = 0}^{\infty} α_{n} = \infty$ , $\sum_{n = 0}^{\infty} {(α_{n})}^{2} < \infty$ and the maximum errors of $Q_{n} (s, a)$ and $Q^{*} (s, a)$ are bounded. The proof can be found in Appendix B.

4.2. The Procedure of the HE-PDS Learning Algorithm

According to the analysis stated above, the working procedure of HE-PDS learning algorithm is summarized in Algorithm 1.

Algorithm 1: HE-PDS learning algorithm.

( $1$ ) Initialization: Simulation times N, $n = 0$ , PDS value function ${\tilde{Q}}_{0}$ , heuristic function $H (s_{0}, a_{0})$ , evaluation function $E (s_{0}, a_{0})$ .

( $2$ ) While ( $n \leq N$ ) do

( $3$ ) Observe the current state $s_{n}$ .

( $4$ ) Select and execute action $a_{n}$ obtained from (21).

( $5$ ) Observe the immediate reward $r (\tilde{s})$ and the next state $a_{n} + 1$ .

( $6$ ) Update the functions of $H_{n} (s_{n}, a_{n})$ , $E_{n} (s_{n}, a_{n})$ and Q value ${\tilde{Q}}_{n + 1} ({\tilde{s}}_{n})$ by (22), (23) and (25), respectively.

( $7$ ) Set $n \leftarrow n + 1$ .

( $8$ ) End while

5. Numerical Results and Discussion

In this section, we will compare the performance of the proposed algorithm with that of the traditional Q learning and PDS learning algorithm. In the numerical computation, we assume that the bits can be mapped into QAM symbols by Gray code in physical layer similar to [8, 11]. The buffer length is $B = 25$ packets and the packet length is equal to $l = 5000$ bits. Assume that the channel transition distribution is known. In particular, the channel state and its transition probability are described in Table 2 similar to [10]. The noise power density $N_{0} / 2$ is set to 10⁻¹¹ Watt/Hz. We let the channel bandwidth W be equal to symbol rate $(W = 1 / T_{s})$ , where $T_{s}$ is the duration of one MQMA symbol and $1 / T_{s} = 500 \times 103$ symbol/second.

Table 2

Channel states and transition probabilities.

Channel states	$φ_{k}$	$p (k, k - 1)$	$p (k, k)$	$p (k, k + 1)$
1	0.1068	0	0.8387	0.1613
2	0.2301	0.1613	0.6669	0.1718
3	0.3760	0.1718	0.6612	0.1670
4	0.5545	0.0670	0.6841	0.1489
5	0.7847	0.1489	0.7330	0.1181
6	1.1090	0.1181	0.8096	0.0723
7	1.6636	0.0723	0.9277	0

In the typical 802.11 $a / b / g$ applications, we let $P_{idle}$ be 0.05 Watt and let both $P_{on}$ and $P_{tr}$ be 0.31 Watt. In addition, the BER is set to ${2, 4, 8, 16, 32} \times 10^{- 8}$ and the number of transmitted packets per time slot can be from 1 to 10. Thus, the state-action pairs number in our simulation is $7 \times 26 \times 2 \times 5 \times 11 \times 2$ . Besides, the other parameters are set as follows: $Δ t = 1$ ms, $γ = 0.98$ , $ε = ω = 1$ , $P = 0.78$ , $σ = 0.005$ , and $τ = 5000$ .

5.1. Performance Comparison under the Fixed Delay and Throughput Constraints

Figure 6 compares the cumulative average costs for 80000 time slots under the maximum delay ( $4 / B$ packets) and throughput ( $0.1 / B$ packets) constraints. In each subgraph, horizontal coordinate represents the simulation time slot, and the vertical coordinate denotes the cumulative average total cost, delay overhead, throughput cost, and energy consumption in the corresponding slot, respectively. From (a), (b), and (c), we observe that the HE-PDS algorithm and PDS algorithm reduce their cumulative average total costs, cumulative average delay overhead, and cumulative average throughput cost by around ten times compared to the Q algorithm. In addition, the three metrics of the HE-PDS algorithm decrease about 8%, 10%, and 9% than PDS algorithm, respectively. Furthermore, as described in (d), the Q algorithm has lower energy consumption at the beginning of the simulation, but its costs increase and stabilize after about 40000th time slot. Since there has been no experience about the environment, the PDS strategy and HE-PDS strategy have larger power costs at the start of the simulation; however, the costs decrease over increasing the simulation time. In addition, the PDS consumes higher power costs than HE-PDS since HE-PDS balances the trade-off between exploration and exploitation which results in a sharper consumption decline in (d).

Figure 6

Performance comparison under the constraint of fixed delay and throughput. (a) Cumulative average total cost. (b) Cumulative average delay overhead. (c) Cumulative average throughput cost. (d) Cumulative average energy consumption.

5.2. Performance Comparison under Various Delay Limits

To validate the performance of the HE-PDS algorithm for different delay limits, taking values of $[3,4, 5,6, 7,8, 9,10,11] / B$ packets, respectively, Figure 7 shows the delay-energy trade-off obtained by these three algorithms. From Figure 7, we observe that the power costs of the HE-PDS algorithm decrease about 50% and 28% than the traditional Q algorithm and PDS algorithm, respectively. We also observe that the power costs of all these algorithms decrease as the delay constraint values increase. Besides, the Q algorithm gets into steady state at the $9 / B$ packet/slot and the PDS will get it at the $8 / B$ packet/slot; however, the times that are required to enter the steady state is significantly reduced to $5 / B$ packet/slot for HE-PDS algorithm. This suggests that HE-PDS has an obvious advantage in energy-efficient under various delay constraints.

Figure 7

Delay-energy trade-off for different algorithms.

5.3. Performance Comparison under Various Throughput Limits

To verify the performance under various throughput limits, Figure 8 shows the performance comparison under the throughput limits $[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] / B$ packets for 80000 time slots simulation. From Figure 8, it is clear that the proposed HE-PDS algorithm can find the optimal policy at the $0.5 / B$ packet/slot which reduces about $0.2 / B$ packet/slot compared with the other two algorithms. In addition, Figure 8 further confirms that the proposed algorithm significantly outperforms the Q and PDS algorithms since HE-PDS can decrease the power costs about 100% and 56% compared to the Q algorithm and PDS algorithm under the different throughput constraints. Obviously, this observation is in accordance with formula (14).

Figure 8

Throughput-energy trade-off for different algorithms.

5.4. Algorithm Convergence Analysis

In this section, we evaluate how the parameter γ will affect the convergence of the HE-PDS energy-efficient algorithm. γ will be set to 0.98 and 0.85, respectively. We also set T to $4 / B$ packets and D to $0.1 / B$ packets. As can be seen in Figure 9, the results of the energy consumption under the fixed delay and throughput constraints and different γ are illustrated in (a). The convergence fluctuations of these algorithms are shown in (b) and (c). For simplicity of illustration, we define the relative fluctuation function at the time slot n as $ψ (n) = lo g_{M} (∥ ρ (n + 1) - ρ (n) ∥ / ∥ ρ (n + 1) ∥)$ , where ρ is the energy consumption and M is a real value. Therefore, the smaller ψ value will reflect smaller energy fluctuation which means faster convergence speed. From Figure 9(a), we observe that the proposed algorithm converges with lower energy consumption than the other two algorithms. For example, when γ is 0.98, HE-PDS algorithm can converge with approximate energy consumption value of 170 mJ, while Q and PDS algorithms converge with energy consumptions of about 300 mJ and 290 mJ, respectively. In addition, when γ is reduced to 0.85, the proposed algorithm can converge with energy consumption value by 70 mJ lower than the other two algorithms, both which converge with the same consumption of about 220 mJ. Furthermore, as shown in (b) and (c), our proposed algorithm can obtain the lowest relative fluctuation values, which means that HE-PDS has the fastest convergence rate. Specifically, the ψ value of HE-PDS algorithm is smaller than the PDS and Q algorithms about 21% and 23% when γ is 0.98. Meanwhile, when γ is equal to 0.85, the ψ value of the HE-PDS algorithm becomes lower than the PDS and Q algorithms about 15% and 17%. This improvement of the performance is due to the fact that our proposed algorithm explicitly uses the heuristic function and evaluation function to effectively reduce the number of actions to be chosen. Consequently, the relative fluctuation results confirm that HE-PDS algorithm can achieve the obvious convergence improvement.

Figure 9

Algorithm convergence with different discount factor γ. (a) Energy consumption convergence speed of the three strategies with $γ = 0.98$ and $γ = 0.85$ . (b) Algorithm relative fluctuation function with $γ = 0.98$ . (c) Algorithm relative fluctuation function with $γ = 0.85$ .

6. Conclusion

In this paper, we investigated the impacts of the cooperative communications and designed cooperative cross-layer algorithm on energy-efficient policy in wireless networks while subjected to both transmission delay and throughput constraints. Given the dynamic buffer, time-varying channel states, and system-level power consumption in a point to point transmission environment, the problem is formulated as a CMDP and further converted into an UMDP by Lagrange multiplier. We propose an HE-PDS learning algorithm based on the determinate state information to achieve an optimal energy-efficient strategy by using the heuristic function and evaluation function. Furthermore, the performance of different energy-efficient strategies is compared and the proposed scheme is verified through simulations. Through the discussions, we highlight that the proposed algorithm has much better energy-efficient performance and faster convergence speed than the other typical state-of-the-art schemes.

Footnotes

Appendices

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (no. 61072138 and no. 61379005) and Southwest University of Science and Technology (12zx7127).

References

Chen

Zhang

G. Y.

Fundamental trade-offs on green wireless networks

IEEE Communications Magazine 2011 49 6 30 37

10.1109/MCOM.2011.5783982

2-s2.0-79958756725

Yang

Ulukus

Optimal packet scheduling in an energy harvesting communication system

IEEE Transactions on Communications 2012 60 1 220 230

10.1109/TCOMM.2011.112811.100349

2-s2.0-84857361411

Munir

Gordon-Ross

An MDP-based dynamic optimization methodology for wireless sensor networks

IEEE Transactions on Parallel and Distributed Systems 2012 23 4 616 625

10.1109/TPDS.2011.208

2-s2.0-84858071931

Hoang

A. T.

Motani

Cross-layer adaptive transmission with incomplete system state information

IEEE Transactions on Communications 2008 56 11 1961 1971

10.1109/TCOMM.2008.060618

2-s2.0-57449108281

Gungor

Tan

Koksal

C. E.

El Gamal

Shroff

N. B.

Joint power and secret key queue management for delay limited secure communication

Proceedings of the IEEE INFOCOM

March 2010

San Diego, Calif, USA

10.1109/INFCOM.2010.5462158

2-s2.0-77953298961

Yao

Y. D.

Reinforcement learning based adaptive rate control for delay-constrained communications over fading channels

Proceedings of the International Joint Conference on Neural Networks

2010

Salodkar

Bhorkar

Karandikar

Borkar

An on-line learning algorithm for energy efficient delay constrained scheduling over a fading channel

IEEE Journal on Selected Areas in Communications 2008 26 4 732 742

10.1109/JSAC.2008.080514

2-s2.0-43349095541

Jiang

Liu

Wang

A heuristic evaluation PDS algorithm for energy-efficient delay constrained scheduling over wireless communication

Proceedings of the IEEE International Conference on Communications (ICC '12)

June 2012

Ottawa, Canada

6013 6017

10.1109/ICC.2012.6364901

2-s2.0-84872002025

Zhong

Energy-efficient wireless packet scheduling with quality of service control

IEEE Transactions on Mobile Computing 2007 6 10 1158 1170

10.1109/TMC.2007.1012

2-s2.0-34547987105

10.

Hoang

A. T.

Motani

Cross-layer adaptive transmission: optimal strategies in fading channels

IEEE Transactions on Communications 2008 56 5 799 807

10.1109/TCOMM.2008.060214

2-s2.0-44949086753

11.

Mastronarde

van der Schaar

Fast reinforcement learning for energy-efficient wireless communication

IEEE Transactions on Signal Processing 2011 59 12 6262 6266

10.1109/TSP.2011.2165211

MR2907920

2-s2.0-81455148130

12.

Liu

Zhou

Giannakis

G. B.

Queuing with adaptive modulation and coding over wireless links: cross-layer analysis and design

IEEE Transactions on Wireless Communications 2005 4 3 1142 1153

10.1109/TWC.2005.847005

2-s2.0-18144405461

13.

Hussain

S. I.

Hasna

M. O.

Alouini

Performance analysis of selective cooperation with fixed gain relays in Nakagami-m channels

Physical Communication 2012 5 3 272 279

10.1016/j.phycom.2012.03.002

2-s2.0-84861950559

14.

Miao

Himayat

Swami

Cross-layer optimization for energy-efficient wireless communications: a survey

Wireless Communications and Mobile Computing 2009 9 4 529 542

10.1002/wcm.698

2-s2.0-65249093600

15.

Kumar

Enhancing coexistence, quality of service, and energy performance in dynamic spectrum access networks [Ph.D. thesis] 2011

16.

Chung

Benini

Bogliolo

H. Y.

de Micheli

Dynamic power management for nonstationary service requests

IEEE Transactions on Computers 2002 51 11 1345 1361

10.1109/TC.2002.1047758

MR2052103

2-s2.0-0036859651

17.

Naqvi

Berber

Salcic

Energy efficient collaborative communication with imperfect phase synchronization and Rayleigh fading in wireless sensor networks

Physical Communication 2010 3 2 119 128

10.1016/j.phycom.2010.03.002

2-s2.0-77951205146

18.

Zhu

Wang

Luo

Adaptive transmission scheduling over fading channels for energy-efficient cognitive radio networks by reinforcement learning

Telecommunication Systems 2009 42 1-2 123 138

10.1007/s11235-009-9174-9

2-s2.0-69549108237

19.

Adam

Buşoniu

Babuška

Experience replay for real-time reinforcement learning control

IEEE Transactions on Systems, Man and Cybernetics C 2012 42 2 201 212

10.1109/TSMCC.2011.2106494

2-s2.0-84857501996

20.

Bandyopadhyay

Saha

Maulik

Deb

A simulated annealing-based multiobjective optimization algorithm: AMOSA

IEEE Transactions on Evolutionary Computation 2008 12 3 269 283

10.1109/TEVC.2007.900837

2-s2.0-55749098965