Sage Journals: Discover world-class research

Abstract

Wireless sensor network has been widely used in different fields, such as structural health monitoring and artificial intelligence technology. The routing planning, an important part of wireless sensor network, can be formalized as an optimization problem needing to be solved. In this article, a reinforcement learning algorithm is proposed to solve the problem of optimal routing in wireless sensor networks, namely, adaptive TD( $λ$ ) learning algorithm referred to as ADTD( $λ$ ) under Markovian noise, which is more practical than i.i.d. (identically and independently distributed) noise in reinforcement learning. Moreover, we also present non-asymptotic analysis of ADTD( $λ$ ) with both constant and diminishing step-sizes. Specifically, when the step-size is constant, the convergence rate of $O (1 / T)$ is achieved, where $T$ is the number of iterations; when the step-size is diminishing, the convergence rate of $\tilde{O} (1 / \sqrt{T})$ is also obtained. In addition, the performance of the algorithm is verified by simulation.

Keywords

Adam Markov sampling reinforcement learning TD (λ)wireless sensor networks

Introduction

The application of artificial intelligence (AI) technology makes the network more intelligent such as improve network resources intelligently,¹ intelligent firewall,² invasion defense,³ congestion control,^4,5 and privacy protection.⁶ In addition, AI technology has also been gradually applied to wireless sensor networks, for instance.^7–9 Wireless sensor networks collect, process, and transmit data through sensors and routing strategy is a key problem in data transmission, so we can model it as an optimization problem to be solved by reinforcement learning. Reinforcement learning is the process of interaction between an agent and the environment. That is, the agent observes the states of the environment and takes according actions; then, the environment feeds back corresponding rewards to the agent and transforms to the next state. The goal of reinforcement learning is to find a policy that maximizes the cumulative rewards and this process can be generally formalized as a Markov decision process (MDP). To achieve this goal, we need to design reinforcement learning algorithms.

Under a given strategy, how to evaluate cumulative rewards is considerable for designing a reinforcement learning algorithm and this problem is also known as the policy evaluation problem. Temporal-difference (TD) learning can effectively solve this problem, which was proposed by Sutton.¹⁰ People generally combine TD learning with function approximation methods and use function approximator to update the parameter of value function. Furthermore, many scholars have analyzed the performance of TD learning in previous studies.^11–14 Recently, the asymptotic analysis of TD learning was presented.^15–17 Although the asymptotic analysis is also useful for algorithms, finite-time bounds can be used to quantify convergence rates. Therefore, the finite-time analysis for TD learning has been proved in previous studies.^18–21 However, they all studied the non-asymptotic convergence analysis of TD(0) and its variants. In fact, TD(0) is an extreme case of TD( $λ$ ), but the difference is that TD( $λ$ ) introduces the eligibility traces vector, which is the weighted sum of the previous gradients. Eligibility traces have obvious computing advantages and can improve the algorithm performance obviously in practical. Therefore, TD( $λ$ ) is known to perform better than TD(0), whereas the finite-time analysis is more intricate.

Recently, there are many theoretical analyses of TD ( $λ$ ).^22,23 etc. In addition, in Srikant and Ying,²⁴ the finite-time bound for TD( $λ$ ) with constant step-size was proved. In Bhandari et al.,²⁵ non-asymptotic analysis of TD( $λ$ ) was proved by projection. Nevertheless, both of them used SGD algorithm to update parameters and adopted constant step-size, which made the algorithm sensitive to the choice of step-size. In recent years, Adam-type algorithms are a good application in reinforcement learning, such as Gupta et al.,²⁶ Stooke and Abbeel,²⁷ and Papini et al.²⁸ Lately, the finite-time analysis of Adam-type TD(0) under Markov sampling was performed and used the information theory technique to control gradient bias in Xiong et al.²⁰ But this article didn’t extend to TD( $λ$ ). In addition, Sun et al.²⁹ proved the asymptotic convergence of TD( $λ$ ) with adaptive gradient descent. Although they used the Adam algorithm, they assumed that the sampling process was independently identically distribution, which is rare in practice and is more subject to Markov distribution. Therefore, how to design an Adam-type TD( $λ$ ) algorithm under Markov sampling and analyze its convergence rate is a problem needs to be solved.

To fill this gap, we propose an adaptive TD( $λ$ ) algorithm under Markovian noise, referred to as ADTD( $λ$ ). We adopt an Adam-type algorithm to accelerate the convergence rate and reduce sensitivity to step-size selection. Moreover, we also present a non-asymptotic analysis of ADTD( $λ$ ) with constant and decreasing step-size, respectively. In addition, we employ projection operators and information theory techniques to control the gradient bias.

Our main contributions are elaborated below:

To solve the routing planning problem in wireless sensor networks, we present an adaptive TD( $λ$ ) algorithm referred to as ADTD( $λ$ ) under Markov sampling with linear approximation. Moreover, ADTD( $λ$ ) combines Adam-type methods, projection operators, and information theory techniques.

Through rigorous convergence analysis, we obtain the error upper bounds between the estimated parameter and the optimal parameter under the condition of different step-sizes. That is to say, when the step-size is constant, the convergence rate of $O (1 / T)$ is achieved; when the step-size is diminishing, the convergence rate is $\tilde{O} (1 / \sqrt{T})$ , where $T$ is the number of iterations.

The main structure of the rest of the article is as follows: In “Related work” section, we comb through the literature. In “Preliminaries” section, we will introduce some basics of TD algorithm. In “Algorithm design and assumptions” section, we present our main algorithm ADTD( $λ$ ) and some necessary assumptions. In “Main results” section, we first show the convergence results of ADTD( $λ$ ). In “Convergence analysis” section, which is the most important part of the article, we will prove the convergence analysis in detail. In “Simulations” section, we verify the performance of the algorithm through simulation experiments. In the end, we summarize the work of this article in “Conclusion” section.

Notations: The notation $| | y | |$ denotes the $ℓ_{2}$ norm of $y$ , and $| | y | |_{\infty}$ denotes the $ℓ_{\infty}$ norm of $y$ . The symbol $x ⊙ y$ denotes the Hadamard product between the vectors $x$ and $y$ , and $Π$ denotes a projection operator. The symbol $(\cdot)^{T}$ denotes the transpose of a matrix or vector. The symbol $\tilde{O} (\cdot)$ means the $\log T$ term is eliminated when we perform complexity analysis of the algorithm.

Related work

In this section, we will sort out the literature of TD learning and adaptive algorithm and divide it into two parts.

TD learning

As one of the most popular research directions in recent years, reinforcement learning has improved a lot. People constantly enrich its theoretical system especially, which also makes reinforcement learning more widely used in practice than ever before. The TD( $λ$ ) algorithm was first proposed by Sutton.¹⁰ After this, Dayan³⁰ proved the mean convergence of TD( $λ$ ) in. Several scholars proved that TD( $λ$ ) can converge with probability 1. Tsitsiklis and Van Roy proved asymptotic error bounds for linear TD( $λ$ ) in Dayan and Sejnowski.¹¹ In the last decade, people have put forward GQ( $λ$ ),³¹ HTD( $λ$ ),³² emphasis TD( $λ$ ),³³ and so on. Recently, Bhandari et al.²⁵ used projection proved the finite-time analysis of centralized TD( $λ$ ) under Markov sampling, Similarly, in Srikant and Ying,²⁴ although they didn’t use the projection method, they use the Lyapunov function to prove the finite-time bound with constant step-size; however, neither of them took into account the adaptive step-size. Sun et al.²⁹ proved the convergence of adaptive gradient descent TD( $λ$ ), but they assumed that the samples are under i.i.d. noise, which is less practical than Markovian noise. Therefore, we will propose a TD( $λ$ ) variant, that is TD( $λ$ ) under Markov sampling and combine it with Adam to accelerate the convergence while implementing adaptive step-size.

Adaptive algorithms

Kingma and Ba³⁴ first proposed Adam algorithm in and Adam performs well in practical application. But Reddi et al.³⁵ showed that the Adam algorithm in Kingma and Ba³⁴ might not converge. Therefore, they proposed AMSGrad, which proved the convergence performance of Adam although it only made some modifications to Adam. In addition, many Adam-related analyses were proposed, such as Zou and Shen³⁶ and Li and Orabona.³⁷ We afford the convergence analysis for adaptive TD( $λ$ ) with Markov noise.

Preliminaries

In wireless sensor networks, sensor node can be regarded as an agent, the routing table saved by the node represents its environment, and sensor nodes can determine their actions through the routing table and return, that is, the next hop address. General routing strategies have three optimization objectives: the first is to find the fastest path to the target node, the second is to ensure that the information can still be transmitted to the target node when some nodes are interrupted in the transmission process, and the third is to ensure load balancing. We can choose different return functions for different optimization objectives; to ensure load balancing, the goal is to average the number of work times of each node, so the corresponding return $r_{t}$ at instant $t$ can set a function inversely proportional to the number of forwarding times. It can be seen that this process can correspond to the reinforcement learning model; therefore, we can use reinforcement learning algorithm to solve the routing planning problem in wireless sensor networks.

To prepare for the following content, we will introduce some preliminaries about reinforcement learning and TD algorithm.

TD algorithm

MDP describes the main elements of reinforcement learning in a 5-tuple: $(S, A, P, R, γ)$ , where $S$ means the finite state space of the environment, $A$ represents the actions set that the agent is likely to take, $P$ is the probability of transition from one state to the next state, $R$ is the reward for moving on to the next state after taking a certain action, and $γ$ is a discounting factor, indicating that the further away the moment, the less influence on the present. The goal of reinforcement learning is to maximize the cumulative discount reward, which can be named as value function and expressed by $V (s)$ , and the form is as follows

V (s) \overset{Δ}{=} E [\sum_{t = 0}^{\infty} γ^{t} R (s, s') | s_{0} = s], \forall s \in S

(1)

where $s'$ is the next state of $s$ and $R (s, s')$ is the return from the transition from state $s$ to state $s'$ .

The above equation can be written as Bellman equation

V (s) = R (s, s') + γ \sum_{s' \in S} P_{ss'} V (s'), \forall s \in S

(2)

where $P_{ss'}$ is the probability of transition from state $s$ to $s'$ .

In general, we can parameterize the value function $V_{μ}$ as ${\tilde{V}}_{x}$ , and the approximation function ${\tilde{V}}_{x}$ can be used to approximate real value function $V$ . Here we also use linear function approximation, which is more conducive to the subsequent proof. ${\tilde{V}}_{x}$ takes the form

V (s) \approx {\tilde{V}}_{x} (s) = ϕ {(s)}^{T} x

where $ϕ (s) = (ϕ_{1} (s), \dots, ϕ_{d} (s))^{T} \in R^{d}$ is basis function of $\tilde{V} (s)$ , and $x \in R^{d}$ is characteristic parameters. Just to make the notation more concise, let $Φ \in R^{| S | \times d}$ as a compact matrix

Φ = [\begin{matrix} | & | \\ ϕ_{1} & \dots & ϕ_{d} \\ | & | \end{matrix}] = [\begin{matrix} - & ϕ {(1)}^{T} & - \\ ⋮ & ⋮ & ⋮ \\ - & ϕ {(| S |)}^{T} & - \end{matrix}] \in R^{| S | \times d}

where $| S |$ is the number of states space, and $d << | S |$ . Then, we get

\tilde{V} (x) = Φ x

(3)

Thus, $\nabla {\tilde{V}}_{x} (s) = ϕ (s)$ and $\nabla \tilde{V} (x) = Φ$ . The parameters of TD algorithm update follow the following rules:

$d_{t}$ is the estimated error at the current time, which is called TD error. It can be written as

d_{t} = r_{t} + γ \tilde{υ} (s_{t + 1}, x_{t}) - \tilde{υ} (s_{t}, x_{t})

(4)

The parameters are updated as follows

x_{t + 1} = x_{t} + α d_{t} \nabla \tilde{υ} (s_{t}, x_{t})

(5)

We can view $d_{t} \nabla \tilde{υ} (s_{t}, x_{t})$ as the gradient “ $g$ ” in the traditional stochastic gradient descent method, although it’s not a real gradient. Then, we define $g (x_{t}, ξ_{t})$ as

\begin{matrix} g (x_{t}, ζ_{t}) : = d_{t} \nabla \tilde{υ} (s_{t}, x_{t}) \\ = ϕ (s_{t}) [γ ϕ {(s_{t + 1})}^{T} x_{t} + r_{t} - ϕ {(s_{t})}^{T} x_{t}] \end{matrix}

(6)

where $ζ_{t}$ is the information observed in environment at the $t$ th state transition, like $(s_{t}, s_{t + 1}, r_{t})$ .

The parameters update rule can also be written as

x_{t + 1} = x_{t} + α g (x_{t}, ζ_{t})

(7)

TD(λ) algorithm

The TD( $λ$ ) algorithm perfectly explains the relationship between Monte Carlo algorithm and TD algorithm, in which $λ \in [0, 1]$ . When $λ = 0$ , it is TD (0) algorithm, and when $λ = 1$ , it is Monte Carlo algorithm. Generally speaking, when $λ$ takes the middle value, it will perform better than the first two extreme algorithms.

Unlike TD(0), TD( $λ$ ) has an extra eligibility traces vector $z \in R^{d}$ . In TD(0), we used the direction $ϕ (s_{t})$ of the eigenvector to updating parameter. In TD( $λ$ ), instead of using the direction $ϕ (s_{t})$ , the updating parameter uses the direction of the eligibility traces $z_{0 : t}$ , and $z_{0 : t}$ follows the update function

z_{0 : t} = \sum_{j = 0}^{t} {(γ λ)}^{j} ϕ (s_{t - j})

(8)

$z_{0 : t}$ is used to capture the influence of each feature on TD error, and by simplifying it, we can write it as

z_{0 : t} = γ λ z_{0 : t - 1} + ϕ (s_{t})

(9)

The parameters are updated as follows

x_{t + 1} = x_{t} + α d_{t} z_{0 : t}

(10)

Like TD (0), we can also define $g_{t} (x_{t}, z_{0 : t})$ , which acts the same as “ $g$ ” in the gradient descent method. The definition of $g_{t} (x_{t}, z_{0 : t})$ is as follows

g_{t} (x_{t}, z_{0 : t}) = d_{t} z_{0 : t}

(11)

\bar{g} (x) = lim_{t \to \infty} E [d_{t} z_{0 : t}]

(12)

Parameter updates can also be represented by the following equation

x_{t + 1} = x_{t} + α g_{t} (x_{t}, z_{0 : t})

(13)

That’s the basics of the original TD( $λ$ ) algorithm, but in this article, we consider to add Adam-type algorithm on the original TD( $λ$ ). Therefore, in the next section, we will further improve the algorithm.

Algorithm design and assumptions

In this section, we present the main algorithm in detail and propose some basic assumptions for subsequent convergence analysis.

Algorithm design

Our algorithm combines TD( $λ$ ) and AMSGrad. $β_{1}$ and $β_{2}$ are hyper-parameters, and $β_{1, t} = β_{1} γ^{t}$ . $m_{t}$ is first-order momentum and $υ_{t}$ is second-order momentum. ${\hat{V}}_{t}$ is a diagonal matrix composed of ${\hat{υ}}_{t}$ . Moreover, $Π_{D, V_{t}^{- 1 / 4}}$ is a projection operator onto a norm ball less than $D_{\infty}$ . The details are in Algorithm 1:

Algorithm 1. ADTD( $λ$ ): Adaptive TD( $λ$ ) under Markov Sampling.
for $t = 0, 1, 2, \dots, T$ do $d_{t} = (r_{t} + γ ϕ {(s_{t + 1})}^{T} x_{t} - ϕ {(s_{t})}^{T} x_{t})$ $z_{0 : t} = γ λ z_{0 : t - 1} + ϕ (s_{t})$ $g_{t} (x_{t}, z_{0 : t}) = d_{t} z_{0 : t}$ $m_{t} = β_{1, t} m_{t - 1} + (1 - β_{1, t}) g_{t} (x_{t}, z_{0 : t})$ $υ_{t} = β_{2} υ_{t - 1} + (1 - β_{2}) g_{t} (x_{t}, z_{0 : t}) ⊙ g_{t} (x_{t}, z_{0 : t})$ ${\hat{υ}}_{t} = max ({\hat{υ}}_{t - 1}, υ_{t}), {\hat{V}}_{t} = diag ({\hat{υ}}_{t})$ $x_{t + 1} = Π_{D, V_{t}^{1 / 4}} (x_{t} + α_{t} {\hat{V}}_{t}^{- 1 / 2} m_{t})$ where $Π_{D, V_{t}^{1 / 4}} (x') = min_{x \in D} ‖ V_{t}^{1 / 4} (x' - x) ‖$ end for

Algorithm 1. ADTD(

λ

): Adaptive TD(

λ

) under Markov Sampling.

for

t = 0, 1, 2, \dots, T

d_{t} = (r_{t} + γ ϕ {(s_{t + 1})}^{T} x_{t} - ϕ {(s_{t})}^{T} x_{t})

z_{0 : t} = γ λ z_{0 : t - 1} + ϕ (s_{t})

g_{t} (x_{t}, z_{0 : t}) = d_{t} z_{0 : t}

m_{t} = β_{1, t} m_{t - 1} + (1 - β_{1, t}) g_{t} (x_{t}, z_{0 : t})

υ_{t} = β_{2} υ_{t - 1} + (1 - β_{2}) g_{t} (x_{t}, z_{0 : t}) ⊙ g_{t} (x_{t}, z_{0 : t})

{\hat{υ}}_{t} = max ({\hat{υ}}_{t - 1}, υ_{t}), {\hat{V}}_{t} = diag ({\hat{υ}}_{t})

x_{t + 1} = Π_{D, V_{t}^{1 / 4}} (x_{t} + α_{t} {\hat{V}}_{t}^{- 1 / 2} m_{t})

where

Π_{D, V_{t}^{1 / 4}} (x') = min_{x \in D} ‖ V_{t}^{1 / 4} (x' - x) ‖

end for

Assumptions

We specify some fundamental assumptions, which are also standard assumptions in previous stuides,^24,25,29 and these assumptions always hold throughout the rest of this article.

Assumption 1

The Markov chain associated is ergodic with $P$ , which also means the Markov chain mixes uniformly, and the following inequality holds

sup_{s \in S} | P (s_{t} | s_{0} = s) - π (s) | \leq c ρ^{t}

where $π$ is a stationary distribution, and $c \geq 0$ and $0 \leq ρ \leq 1$ are all constant.

Assumption 2

We assume that the matrix $Φ$ is linearly independent. Moreover, we assume that $∥ ϕ (s) ∥ \leq 1$ .

Assumption 3

We assume that rewards are uniformly bounded. That is, $| r (s_{t}, s'_{t}) | \leq r_{max}$ , where $r_{max}$ is a constant.

Assumption 4

We assume $x_{k}$ is uniformly bounded. That is, $∥ x_{k} ∥ \leq D_{\infty}$ , where $D_{\infty}$ is a constant.

Main results

The following content will show the main theoretical results. Specifically, we obtain an upper bound on the error between the estimated parameter and the optimal parameter under the condition of different step-sizes. In the next section, we will prove it in detail.

Theorem 1

Under above Assumptions 1–4, we assume that $∥ x_{i} - x_{j} ∥ \leq D_{\infty}$ for any $i, j \in N$ , and $∥ ϕ (s) ∥ \leq 1$ , $| r (s_{t}, s'_{t}) | \leq r_{max}$ , for $\forall s \in S$ . Moreover, we set $B = (r_{max} + 2 D_{\infty}) / (1 - γ λ)$ , and $ε^{*} = max {min {ε : (m ρ)^{ε} \leq α_{T}}, min {ε : (γ λ)^{ε} \leq α_{T}}}$ . Let $α_{t} = α_{0}$ , where $α_{0}$ is a constant, and $t = 1 \dots T$ represents the number of iterations. Then, Algorithm 1 satisfies

(a) when $T \geq 2 ε^{*}$

E ‖ x_{T} - x^{★} ‖^{2} \leq \frac{C_{1}}{T} + α_{0} C_{2}

(14)

where

\begin{matrix} C_{1} = \frac{1}{2 ω (1 - κ) (1 - β_{1})} \\ {\frac{2 {dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} γ {dBD}_{\infty}^{2}}{(1 - γ) α_{0}} + \frac{32 B^{2} ε^{* 2} α_{0}}{B_{0}} + 4 ε^{*} B D_{\infty} - \frac{4 ε^{*} B^{2} (8 ε^{*} + 3 B_{0}) α_{0}}{B_{0}}} \\ C_{2} = \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{(1 + β_{1}) B^{2} + 2 B^{2} (8 ε^{*} + 3 B_{0})}{B_{0}}} \end{matrix}

(b) when $T \leq 2 ε^{*}$

E ‖ x_{T} - x^{★} ‖^{2} \leq \frac{C_{3}}{T} + α_{0} C_{4}

(15)

where

\begin{matrix} C_{3} = \frac{1}{2 ω (1 - κ) (1 - β_{1})} [\frac{β_{1} γ {dBD}_{\infty}^{2}}{α_{0} (1 - γ)} + \frac{2 {dBD}_{\infty}^{2}}{α_{0}}] \\ C_{4} = \frac{1}{2 ω (1 - κ) (1 - β_{1})} [\frac{(1 + β_{1}) B^{2} + 16 B^{2} ε^{*}}{B_{0}} + \frac{B D_{\infty}}{α_{0}}] \end{matrix}

Remark 1

Under the constant step-size and some hyper-parameters, the results of Theorem 1 show that ADTD( $λ$ ) can converge to a neighborhood of $x^{*}$ by the rate of $O (1 / T)$ , where $C_{1} - C_{4}$ are constants. Moreover, the error upper bounds between the estimated parameter and the optimal parameter are relative to the number of dimensions of ${\hat{V}}_{t}$ , and the $α_{0}$ is proportional to the size of the neighborhood.

Theorem 2

Under above Assumptions 1–4, we assume that $∥ x_{i} - x_{j} ∥ \leq D_{\infty}$ for any $i, j \in N$ , and $∥ ϕ (s) ∥ \leq 1$ , $| r (s_{t}, s'_{t}) | \leq r_{max}$ , for $\forall s \in S$ . In addition, we initialize ${\hat{υ}}_{0, i} \geq B_{0}^{2}$ , $β_{1, t} = β_{1} γ^{t}$ . We set $B = (r_{max} + 2 D_{\infty}) / (1 - γ λ)$ , and $ε^{*} = max {min {ε : (m ρ)^{ε} \leq α_{T}}, min {ε : (γ λ)^{ε} \leq α_{T}}}$ , let $α_{t} = α_{0} / (\sqrt{t} + 1)$ , where $α$ is a constant, and $t = 1 \dots T$ is the number of iterations. Then, we have

(a) when $T \geq 2 ε^{*}$

E ‖ x_{T} - x^{★} ‖^{2} \leq \frac{C_{5}}{\sqrt{T}} + \frac{C_{6}}{T} + α_{0} C_{7}

(16)

where

\begin{array}{l} C_{5} : \frac{d B D_{\infty}^{2}}{2 α_{0} ω (1 - κ) (1 - β_{1})} \\ C_{6} : \frac{1}{2 ω (1 - κ) (1 - β_{1})} \\ {\frac{d B D_{\infty}^{2}}{α_{0}} + \frac{β_{1} d B D_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} + \frac{32 B^{2} α_{0} ε^{* 2}}{B_{0}} + 4 ε^{*} B D_{\infty} - \frac{4 ε^{*} B^{2} (8 ε^{*} + 3 B_{0}) α_{0}}{B_{0}}} \\ C_{7} : \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{(1 + β_{1}) B^{2}}{B_{0}} + \frac{2 B^{2} (8 ε^{*} + 3 B_{0})}{B_{0}}} \end{array}

(b) when $T \leq 2 ε^{*}$

E ‖ x_{T} - x^{★} ‖^{2} \leq \frac{C_{5}}{\sqrt{T}} + \frac{C_{8}}{T} + α_{0} C_{4}

where

C_{8} : = \frac{1}{2 ω T (1 - κ) (1 - β_{1})} [\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} dB D_{\infty}}{α_{0} {(1 - γ)}^{2}}]

Remark 2

Similar to Theorem 1, $C_{5} - C_{10}$ are also constants and time-independent; we can adjust $α_{t}$ to make a trade-off between convergence speed and convergence accuracy. Let $α_{t} = α_{0} / (\sqrt{t} + 1)$ ; ADTD( $λ$ ) converges to a neighborhood of $x^{*}$ by the rate of $\tilde{O} (1 / \sqrt{T})$ . Moreover, $E ∥ x_{T} - x^{★} ∥^{2}$ is relative to the number of dimensions of ${\hat{V}}_{t}$ .

Convergence analysis

In this section, we will analyze the convergence performance of ADTD( $λ$ ). In order to get Theorem 1 and Theorem 2, we need some Lemmas to set the stage first.

Lemma 1

To limit the update direction of $TD (λ)$ , for $\forall x \in R^{d}$ , we have (Lemma 3)²⁵

(x^{*} - x) \bar{g} (x) \geq (1 - κ) ‖ V_{x^{*}} - V_{x} ‖^{2} \geq (1 - κ) ω ‖ x^{*} - x ‖^{2}

(17)

where $κ : = (γ (1 - λ)) / (1 - γ λ)$ and $ω \geq 0$ .

Since the gradient estimation is biased under Markov sampling, we also use information theory techniques to control this bias in this article.

Lemma 2

Under Assumption 1, we assume that A and B are two random variables, $t \in N$ , and $ε \geq 0$ ; A to B can be described as a Markov process, such that $A \to s_{t} \to s_{t + ε} \to B$ . Let $A'$ and $B'$ are independent, drawn from the marginal distributions of $A$ and $B$ , so $P (A' = \cdot, B' = \cdot) = P (A = \cdot) \times P (B = \cdot) .$ For any bounded function $f$ , we have (information theory technology, Lemma 3)²⁵

[f (A, B)] - E [f (A', B')] | \leq 2 ‖ f ‖_{\infty} c ρ^{ε}

(18)

where $c > 0$ and $ρ \in (0, 1)$ .

Furthermore, in order to obtain the final conclusion, we need to get the following results. To ensure the integrity of the article, we repeat the proof of Lemma 3.

Lemma 3

Under Assumptions 2–4, for $\forall s \in S$ , we assume that $∥ ϕ (s) ∥ \leq 1$ , $| r (s_{t}, s'_{t}) | \leq r_{max}$ . Then, for $\forall x \in R^{d}$ , we have (Lemma 17)²⁵

‖ g_{t} (x_{t}, z_{0 : t}) ‖ \leq B, ‖ \bar{g} (x) ‖ \leq B

(19)

Proof

According to the definition of $g_{t} (x_{t}, z_{0 : t})$ , we have

‖ g_{t} (x, z_{0 : t}) ‖ = ‖ d_{t} (x) z_{0 : t} ‖ = | d_{t} (x) | ‖ z_{0 : t} ‖

(20)

and then take the bounds of $| d_{t} (x) |$ and $∥ z_{0 : t} ∥$ , respectively

\begin{matrix} | d_{t} (x) | = | γ ϕ {(s_{t + 1})}^{T} x + r_{t} - ϕ {(s_{t})}^{T} x | \\ \leq ‖ ϕ (s_{t + 1}) ‖ ‖ x ‖ + r_{max} + ‖ ϕ (s_{t}) ‖ ‖ x ‖ \\ \leq (r_{max} + 2 D_{\infty}) \end{matrix}

(21)

where the first equation holds because of the definition of $d_{t} (x)$ , the first inequality is based on the norm inequality, and the last inequality uses Assumption 1 that $∥ ϕ (s_{t}) ∥ \leq 1$ , $∥ x ∥ \leq D_{\infty}$ , and $| r_{t} | \leq r_{max}$ .

Next bound $∥ z_{0 : t} ∥$

\begin{matrix} ‖ z_{0 : t} ‖^{2} = ‖ \sum_{j = 0}^{t} {(γ λ)}^{j} ϕ (s_{t - j}) ‖^{2} \\ \leq {(\sum_{j = 0}^{t} {(γ λ)}^{j})}^{2} \leq {(\sum_{j = 0}^{\infty} {(γ λ)}^{j})}^{2} = \frac{1}{{(1 - γ λ)}^{2}} \end{matrix}

(22)

where the first inequality holds since $∥ ϕ (s_{t - j}) ∥ \leq 1$ . Plugging (21) and (22) into (20), we can obtain

‖ g_{t} (x, z_{0 : t}) ‖ = | d_{t} (x) | ‖ z_{0 : t} ‖ \leq \frac{(r_{max} + 2 D_{\infty})}{(1 - γ λ)}

(23)

We note that $B = (r_{max} + 2 D_{\infty}) / (1 - γ λ)$ . Similarly, we can get $∥ d_{t} (x) z_{l : t} ∥ \leq B$ for $l \leq t$ . Let $l \to - \infty$ ; we can get $∥ d_{t} (x) z_{- \infty : t} ∥ \leq B$ . As $\bar{g} (x) = E [d_{t} (x) z_{- \infty : t}]$ , so $∥ \bar{g} (x) ∥ \leq B$ , we omit this proof here.

Next, we need to get some useful bounds that will prepare for the final theorem proof.

Lemma 4

Under Assumptions 2–4, for $\forall s \in S$ , we assume that $∥ ϕ (s) ∥ \leq 1$ , $| r (s_{t}, s'_{t}) | \leq r_{max}$ . By Lemma 3, we have $B = (r_{max} + 2 D_{\infty}) / (1 - γ λ)$ . For any $x \in R^{d}$ , we can obtain the bound as follows

‖ m_{t} ‖ \leq B, ‖ {\hat{υ}}_{t} ‖_{\infty} \leq B^{2}

(24)

Proof

We use the inductive hypothesis to bound $∥ m_{t} ∥$ and $∥ {\hat{υ}}_{t} ∥_{\infty}$ . We first assume that $∥ m_{0} ∥ = 0 \leq B$ and $∥ m_{t} ∥ \leq B$ ; by the definition of $m_{t + 1}$ , we obtain

\begin{matrix} ‖ m_{t + 1} ‖ ‖ β_{1, t + 1} m_{t} + (1 - β_{1, t + 1}) g_{t + 1} (x_{t + 1}, z_{0 : t + 1})) ‖ \\ \leq β_{1, t + 1} ‖ m_{t} ‖ + (1 - β_{1, t + 1}) ‖ g_{t + 1} (x_{t + 1}, z_{0 : t + 1})) ‖ \\ \leq β_{1, t + 1} B + (1 - β_{1, t + 1}) B = B \end{matrix}

(25)

Therefore, we obtain $∥ m_{t} ∥ \leq B$ , for $t \geq 0$ .

Next, we bound $∥ {\hat{υ}}_{t} ∥_{\infty} \leq B^{2}$ . We also first assume that $∥ υ_{0} ∥_{\infty} = ∥ {\hat{υ}}_{0} ∥_{\infty} = 0 \leq B^{2}$ , and $∥ υ_{t} ∥_{\infty} \leq B^{2}$ . By the definition of $υ_{t + 1}$ , we obtain

\begin{matrix} ‖ υ_{t + 1} ‖_{\infty} = ‖ β_{2} υ_{t} + (1 - β_{2}) g_{t + 1} (x_{t + 1}, z_{0 : t + 1})) ‖_{\infty} \\ \leq β_{2} ‖ υ_{t} ‖_{\infty} + (1 - β_{2}) ‖ g_{t + 1} (x_{t + 1}, z_{0 : t + 1})) ‖_{\infty} \\ \leq β_{2} B^{2} + (1 - β_{2}) B^{2} = B^{2} \end{matrix}

(26)

where the second inequality holds due to Lemma 3

‖ g_{t + 1} (x_{t + 1}, z_{0 : t + 1})) ‖_{\infty} \leq ‖ g_{t + 1} (x_{t + 1}, z_{0 : t + 1})) ‖ \leq B

(27)

We have $∥ {\hat{υ}}_{t + 1} ∥_{\infty} = max {∥ {\hat{υ}}_{t} ∥_{\infty}, ∥ υ_{t + 1} ∥_{\infty}} \leq B^{2}$ . Therefore, we obtain that $∥ {\hat{υ}}_{t} ∥_{\infty} \leq B^{2}$ , for all $t \geq 0$ .

Next, we will discuss the function $ξ_{t}$ , which represents the error in the updating direction at instant $t$ . We define it as follows

ξ_{t} (x, z_{l : t}) = {(d_{t} (x) z_{l : t} - \bar{g} (x))}^{T} (x - x^{*})

(28)

where $x \in R^{d}$ , and $l \leq t$ .

To obtain $E [ξ_{t} (x, z_{l : t})]$ , we first need to obtain some properties of $| ξ_{t} (x, z_{l : t}) |$ .

Lemma 5

Under Assumptions 2–4, for $\forall s \in S$ , we assume that $∥ ϕ (s) ∥ \leq 1$ , $| r (s_{t}, s'_{t}) | \leq r_{max}$ . Besides, $∥ x_{i} - x_{j} ∥ \leq D_{\infty}$ for any $i, j \in N$ . Combining the results of Lemma 3 and Lemma 4, for any $x \in R^{d}$ , we can obtain the properties as follows (Lemma 19)²⁵

\begin{matrix} (a) | ξ_{t} (x, z_{l : t}) | \leq 2 B D_{\infty} . \\ (b) | ξ_{t} (x, z_{0 : t}) - ξ_{t} (x, z_{t - ε : t}) | \leq B D_{\infty} (γ λ)^{ε}, for \forall ε \leq t . \\ (c) | ξ_{t} (x, z_{0 : t}) - ξ_{t} (x, z_{- \infty : t}) | \leq B D_{\infty} (γ λ)^{t} . \\ (d) | ξ_{t} (x, z_{l : t}) - ξ_{t} (x', z_{l : t}) | \leq 4 B ∥ (x - x') ∥ . \end{matrix}

Proof

We just prove the part ( $d$ ) of Lemma 5 here. That is to say, $ξ_{t} (x, z_{l : t})$ satisfies $L$ -Lipschitz property. The other parts of Lemma 5 are similar to Bhandari et al.²⁵ and we omit them here

\begin{matrix} | ξ_{t} (x, z_{l : t}) - ξ_{t} (x', z_{l : t}) | \\ = | {(d_{t} (x) z_{l : t} - \bar{g} (x))}^{T} (x - x^{*}) - {(d_{t} (x') z_{l : t} - \bar{g} (x'))}^{T} (x' - x^{*}) | \\ \leq ‖ d_{t} (x) z_{l : t} - \bar{g} (x) ‖ ‖ x - x' ‖ \\ + ‖ x' - x^{*} ‖ ‖ (d_{t} (x) z_{l : t} - \bar{g} (x)) - (d_{t} (x') z_{l : t} - \bar{g} (x')) ‖ \\ \leq [‖ d_{t} (x) z_{l : t} ‖ + ‖ \bar{g} (x) ‖] ‖ x - x' ‖ \\ + ‖ x' - x^{*} ‖ ‖ (d_{t} (x) z_{l : t} - \bar{g} (x)) - (d_{t} (x') z_{l : t} - \bar{g} (x')) ‖ \\ \leq 2 B ‖ x - x' ‖ + D_{\infty} [‖ z_{l : t} (d_{t} (x) - d_{t} (x')) ‖ + ‖ \bar{g} (x) - \bar{g} (x') ‖] \\ \leq 2 B ‖ x - x' ‖ + \frac{4 D_{\infty}}{(1 - γ λ)} ‖ x - x' ‖ \leq 4 B ‖ x - x' ‖ \end{matrix}

(29)

the first inequality holds since $| a^{T} b - c^{T} d | = | a^{T} (b - d) + d^{T} (a - c) | \leq ∥ a ∥ ∥ b - d ∥ + ∥ d ∥ ∥ a - c ∥$ , where $a, b, c, and d$ are four vectors. The last inequality establishes because $2 D_{\infty} / (1 - γ λ) \leq B$ . In the penultimate inequality, we can easily prove $∥ z_{l : t} (d_{t} (x) - d_{t} (x')) ∥$ and $∥ \bar{g} (x) - \bar{g} (x') ∥$ as follows

\begin{matrix} ‖ z_{l : t} (d_{t} (x) - d_{t} (x')) ‖ \leq ‖ z_{l : t} ‖ | (d_{t} (x) - d_{t} (x')) | \\ \leq ‖ \sum_{j = 0}^{\infty} {(γ λ)}^{j} ϕ (s_{t - j}) ‖ | (d_{t} (x) - d_{t} (x')) | \\ \leq \frac{1}{(1 - γ λ)} | {(γ ϕ (s'_{t}) - ϕ (s_{t}))}^{T} (x - x') | \\ \leq \frac{(‖ ϕ ({s'}_{t}) ‖ + ‖ ϕ (s_{t}) ‖)}{(1 - γ λ)} ‖ x - x' ‖ \leq \frac{2}{(1 - γ λ)} ‖ x - x' ‖ \end{matrix}

(30)

Similarly, we can obtain $∥ \bar{g} (x) - \bar{g} (x') ∥ \leq 2 / (1 - γ λ) ∥ x - x' ∥$ .

Using Lemma 5, we obtain some properties of $| ξ_{t} (x_{t}, z_{0 : t}) |$ , and then we can solve for $E [ξ_{t} (x_{t}, z_{0 : t})]$ .

Lemma 6

Under Assumptions 1–4 and Lemma 5, considering the step-size sequence is non-increasing, $ε^{*} = max {min {ε : (m ρ)^{ε} \leq α_{T}}, min {ε : (γ λ)^{ε} \leq α_{T}}}$ , we can obtain

(1) For $2 ε^{*} < t \leq T$

E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{B^{2} (8 ε^{*} + 3 B_{0}) α_{t - 2 ε^{*}}}{B_{0}}

(31)

(2) For all $t \in N$

E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{4 B^{2}}{B_{0}} \sum_{j = 0}^{t - 1} α_{j} + B D_{\infty} {(γ λ)}^{t}

(32)

(3) For $0 \leq t \leq 2 ε^{*}$

E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{8 B^{2} α_{0} ε^{*}}{B_{0}} + B D_{\infty} {(γ λ)}^{t}

(33)

Proof

Case (1): When $t > 2 ε^{*}$ , we obtain

\begin{matrix} E [ξ_{t} (x_{t}, z_{0 : t})] \leq \underset{I}{\underset{︸}{| E [ξ_{t} (x_{t}, z_{0 : t})] - E [ξ_{t} (x_{t - 2 ε}, z_{0 : t})] |}} \\ + \underset{II}{\underset{︸}{| E [ξ_{t} (x_{t - 2 ε}, z_{0 : t})] - E [ξ_{t} (x_{t - 2 ε}, z_{t - ε : t})] |}} \\ + \underset{III}{\underset{︸}{| E [ξ_{t} (x_{t - 2 ε}, z_{t - ε : t})] |}} \end{matrix}

(34)

To get the bound on $E [ξ_{t} (x_{t}, z_{0 : t})]$ , we have to figure out the three terms, respectively.

Item I: According to the Lipschitz property of $ξ_{t} (x_{t}, z_{0 : t})$ in Lemma 5, we can obtain

\begin{matrix} | ξ_{t} (x_{t}, z_{0 : t}) - ξ_{t} (x_{t - 2 ε}, z_{0 : t}) | \leq 4 B ‖ x_{t} - x_{t - 2 ε} ‖ \\ \leq \frac{4 B^{2}}{B_{0}} \sum_{j = t - 2 ε}^{t - 1} α_{j} \end{matrix}

(35)

In equation (35), the term $∥ x_{t} - x_{t - 2 ε} ∥$ is proved as follows

\begin{matrix} ‖ x_{t} - x_{t - 2 ε} ‖ \leq \sum_{j = t - 2 ε}^{t - 1} ‖ Π_{D, V_{j}^{1 / 4}} (x_{j + 1} + α_{j} {\hat{V}}_{j}^{- 1 / 2} m_{j}) - x_{j} ‖ \\ \leq \sum_{j = t - 2 ε}^{t - 1} α_{j} ‖ V_{j}^{- 1 / 2} m_{j} ‖ \leq \frac{B}{B_{0}} \sum_{j = t - 2 ε}^{t - 1} α_{j} \end{matrix}

(36)

where the last inequality holds since

‖ {\hat{V}}_{t}^{- 1 / 2} m_{t} ‖^{2} = \sum_{i = 1}^{d} \frac{m_{t, i}^{2}}{{\hat{υ}}_{t, i}} \leq \frac{1}{B_{0}^{2}} \sum_{i = 1}^{d} m_{t, i}^{2} \leq \frac{B^{2}}{B_{0}^{2}}

(37)

we assume that ${\hat{υ}}_{t, i} \geq {\hat{υ}}_{1, i} \geq B_{0}^{2}, \forall t, \forall i .$

Item II: By Lemma 5, we can directly obtain

| E [ξ_{t} (x_{t - 2 ε}, z_{0 : t})] - E [ξ_{t} (x_{t - 2 ε}, z_{t - ε : t})] | \leq B D_{\infty} {(γ λ)}^{ε}

(38)

Item III: Using the information theory technology to bound $E [ξ_{t} (x_{t - 2 ε}, z_{t - ε : t})]$ , we define that $ϑ_{t - ε : t}$ is a state sequence from $s_{t - ε}$ to $s_{t}$ ; it is also a Markov chain

x_{t - 2 ε} \to s_{t - 2 ε} \to s_{t - ε} \to s_{t} \to O_{t}

(39)

We define a function $h (x_{t - 2 ε}, ϑ_{t - ε : t})$

h (x_{t - 2 ε}, ϑ_{t - ε : t}) : = ξ_{t} (x_{t - 2 ε}, z_{t - ε : t})

(40)

where random variables $x'_{t - 2 ε}$ and $ϑ'_{t - ε : t}$ were drawn independently from the marginal distributions of $x_{t - 2 ε}$ and $ϑ_{t - ε : t}$ , respectively. So $P (x'_{t - 2 ε} = \cdot, ϑ'_{t - ε : t} = \cdot) = P (x_{t - ε} = \cdot) \times P (ϑ_{t - ε : t} = \cdot)$

\begin{matrix} | E [ξ_{t} (x_{t - 2 ε}, z_{t - ε : t})] | = | E [h (x_{t - 2 ε}, ϑ_{t - ε : t})] | \\ \leq | E [h (x_{t - 2 ε}, ϑ_{t - ε : t})] - E [h ({x'}_{t - 2 ε}, {ϑ'}_{t - ε : t})] | \\ + | E [h ({x'}_{t - 2 ε}, {ϑ'}_{t - ε : t})] | \end{matrix}

(41)

According to Lemma 2, we get

| E [h (x_{t - 2 ε}, ϑ_{t - ε : t})] - [h ({x'}_{t - 2 ε}, {ϑ'}_{t - ε : t})] E | \leq 4 B D_{\infty} c ρ^{ε}

(42)

Using Lemma 5, we obtain

\begin{matrix} E [h (x, ϑ_{t - ε : t})] = {(E [d_{t} (x) z_{t - ε : t}] - \bar{g} (x))}^{T} (x - x^{*}) \\ \leq | {(d_{t} (x) z_{- \infty : t - ε})}^{T} (x - x^{*}) | \leq B D_{\infty} {(γ λ)}^{ε} \end{matrix}

(43)

Plugging (42) and (43) into (41), we can obtain

| E [ξ_{t} (x_{t - 2 ε}, z_{t - ε : t})] | \leq 4 B D_{\infty} c ρ^{ε} + B D_{\infty} {(γ λ)}^{ε}

(44)

Now, let $ε = ε^{*}$ , merging (35), (38), and (44), we can obtain

\begin{matrix} E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{4 B^{2}}{B_{0}} \sum_{j = t - 2 ε}^{t - 1} α_{j} + 4 B D_{\infty} c ρ^{ε} + 2 B D_{\infty} {(γ λ)}^{ε} \\ \leq \frac{8 B^{2} ε^{*} α_{t - 2 ε^{*}}}{B_{0}} + 3 B^{2} α_{T} \leq \frac{B^{2} (8 ε^{*} + 3 B_{0}) α_{t - 2 ε^{*}}}{B_{0}} \end{matrix}

(45)

where the second inequality holds because step-size sequence is non-increasing, $2 D_{\infty} \leq B$ , and $(γ λ)^{ε} \leq α_{T}$ , $c ρ^{ε} \leq α_{T}$ . The last inequality follows from $α_{T} \leq α_{t - 2 ε^{*}}$ .

Case (2): For all $t \in N$ , we obtain

\begin{matrix} E [ξ_{t} (x_{t}, z_{0 : t})] \leq | E [ξ_{t} (x_{t}, z_{0 : t})] - E [ξ_{t} (x_{0}, z_{0 : t})] | \\ + | E [ξ_{t} (x_{0}, z_{0 : t})] - E [ξ_{t} (x_{0}, z_{- \infty : t})] | + | E [ξ_{t} (x_{0}, z_{- \infty : t})] | \end{matrix}

(46)

Combining (35) and (38), we can obtain

\begin{matrix} | E [ξ_{t} (x_{t}, z_{0 : t})] - E [ξ_{t} (x_{0}, z_{0 : t})] | \\ + | E [ξ_{t} (x_{0}, z_{0 : t})] - E [ξ_{t} (x_{0}, z_{- \infty : t})] | \\ \leq \frac{4 B^{2}}{B_{0}} \sum_{j = 0}^{t - 1} α_{j} + B D_{\infty} {(γ λ)}^{t} \end{matrix}

(47)

For the last term of (46)

\begin{matrix} E [ξ_{t} (x_{0}, z_{- \infty : t})] = {(E [d_{t} (x_{0}) z_{- \infty : t}] - \bar{g} (x_{0}))}^{T} (x_{0} - x^{*}) = 0 \end{matrix}

(48)

This last equation follows from $\bar{g} (x) = E [d_{t} (x) z_{- \infty : t}]$ .

Therefore, combining (47) with (48), we can obtain

E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{4 B^{2}}{B_{0}} \sum_{j = 0}^{t - 1} α_{j} + B D_{\infty} {(γ λ)}^{t}, \forall t \in N

(49)

Case (3): For $t \leq 2 ε^{*}$ , similar to case (2), we can easily obtain

\begin{matrix} E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{4 B^{2}}{B_{0}} \sum_{j = 0}^{t - 1} α_{j} + B D_{\infty} {(γ λ)}^{t} \\ \leq \frac{4 B^{2} t α_{0}}{B_{0}} + B D_{\infty} {(γ λ)}^{t} \\ \leq \frac{8 B^{2} ε^{*} α_{0}}{B_{0}} + B D_{\infty} {(γ λ)}^{t} \end{matrix}

(50)

where the second and the last inequality holds since $\sum_{j = 0}^{t - 1} α_{j} \leq t α_{0}$ and $t \leq 2 ε^{*} .$

With the assumptions and lemmas in mind, we start to prove Theorem 1; detailed proofs are shown below.

Proof

First, according to the definition of $x_{t + 1}$ , we have

\begin{matrix} ‖ {\hat{V}}_{t}^{1 / 4} (x_{t + 1} - x^{★}) ‖^{2} = ‖ Π_{D, {\hat{V}}_{t}^{1 / 4}} {\hat{V}}_{t}^{1 / 4} (x_{t} - α_{t} {\hat{V}}_{t}^{- 1 / 2} m_{t} - x^{★}) ‖^{2} \\ \leq ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - α_{t} {\hat{V}}_{t}^{- 1 / 2} m_{t} - x^{★}) ‖^{2} = ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} \\ + ‖ α_{t} {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} - 2 α_{t} {(x_{t} - x^{★})}^{T} m_{t} = ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} \\ + ‖ α_{t} {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} - 2 α_{t} {(x_{t} - x^{★})}^{T} [β_{1 t} m_{t - 1} + (1 - β_{1 t}) g_{t} (x_{t}, z_{0 : t})] \\ \leq ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + ‖ α_{t} {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} - 2 α_{t} (1 - β_{1 t}) {(x_{t} - x^{★})}^{T} \\ g_{t} (x_{t}, z_{0 : t}) + α_{t} β_{1 t} (\frac{1}{α_{t}} {‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖}^{2} + α_{t} {‖ {\hat{V}}_{t}^{- 1 / 4} m_{t - 1} ‖}^{2}) \\ = ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + ‖ α_{t} {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} - 2 α_{t} (1 - β_{1 t}) {(x_{t} - x^{★})}^{T} \\ g_{t} (x_{t}, z_{0 : t}) + β_{1 t} ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + α_{t}^{2} β_{1 t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t - 1} ‖^{2} \\ \leq ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + α_{t}^{2} (1 + β_{1 t}) ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + β_{1 t} ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} \\ - 2 α_{t} (1 - β_{1 t}) {(x_{t} - x^{★})}^{T} g_{t} (x_{t}, z_{0 : t}) \end{matrix}

(51)

where the second inequality is established due to Young’s inequality and the last inequality holds because $m_{t - 1} \leq m_{t}$ .

Next, we take the expectation of (51)

\begin{matrix} E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t + 1} - x^{★}) ‖^{2} \leq E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} \\ + α_{t}^{2} (1 + β_{1 t}) E ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + β_{1 t} E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} \\ - 2 α_{t} (1 - β_{1 t}) E [{(x_{t} - x^{★})}^{T} g_{t} (x_{t}, z_{0 : t})] \\ = E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + α_{t}^{2} (1 + β_{1 t}) ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} \\ + β_{1 t} E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} - 2 α_{t} (1 - β_{1 t}) E [{(x_{t} - x^{★})}^{T} {\bar{g}}_{t} (x_{t}, z_{0 : t})] \\ + 2 α_{t} (1 - β_{1 t}) E [{(x_{t} - x^{★})}^{T} ({\bar{g}}_{t} (x_{t}, z_{0 : t})) - g_{t} (x_{t}, z_{0 : t})] \\ \leq E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + α_{t}^{2} (1 + β_{1 t}) ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} \\ + β_{1 t} E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + 2 α_{t} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \\ - 2 α_{t} ω (1 - β_{1 t}) (1 - κ) E ‖ x_{t} - x^{★} ‖^{2} \\ \leq E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + α_{t}^{2} (1 + β_{1 t}) ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} \\ + β_{1 t} E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + 2 α_{t} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \\ - 2 α_{t} ω (1 - β_{1}) (1 - κ) E ‖ x_{t} - x^{★} ‖^{2} \\ \leq E ‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖^{2} + α_{t}^{2} (1 + β_{1}) ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + β_{1 t} {dBD}_{\infty}^{2} \\ + 2 α_{t} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] - 2 α_{t} ω (1 - β_{1}) (1 - κ) E ‖ x_{t} - x^{★} ‖^{2} \end{matrix}

(52)

where the second inequality holds because of Lemma 1 and the third inequality holds since $β_{1, t} \leq β_{1} \leq 1$ . The last inequality is obtained by Lemma 5 and $∥ {\hat{V}}_{t} ∥ \leq \sqrt{d} ∥ {\hat{V}}_{t} ∥_{\infty} \leq \sqrt{d} B^{2}$ . $d$ is the dimension of ${\hat{V}}_{t}$ , and $∥ x_{t} - x^{*} ∥ \leq D_{\infty}$ in Assumption 4.

By rearranging terms of (52) and summing terms from $t = 1$ to $t = T$ , we obtain

\begin{matrix} 2 ω (1 - κ) (1 - β_{1}) \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \\ \leq \sum_{t = 1}^{T} \frac{1}{α_{t}} (E {‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖}^{2} - E {‖ {\hat{V}}_{t}^{1 / 4} (x_{t + 1} - x^{★}) ‖}^{2}) \\ + \sum_{t = 1}^{T} \frac{β_{1 t} {dBD}_{\infty}^{2}}{α_{t}} + (1 + β_{1}) \sum_{t = 1}^{T} α_{t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} \\ + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] = \frac{E {‖ {\hat{V}}_{1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{1}} \\ - \frac{E {‖ {\hat{V}}_{T}^{1 / 4} (x_{T} - x^{★}) ‖}^{2}}{α_{T}} + \sum_{t = 1}^{T} \frac{β_{1 t} {dBD}_{\infty}^{2}}{α_{t}} + (1 + β_{1}) \\ \sum_{t = 1}^{T} α_{t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + \sum_{t = 2}^{T} E (\frac{{‖ {\hat{V}}_{t}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{t}} - \frac{{‖ {\hat{V}}_{t - 1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{t - 1}}) \\ + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] = \frac{E {‖ {\hat{V}}_{1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{1}} \\ - \frac{E {‖ {\hat{V}}_{T}^{1 / 4} (x_{T} - x^{★}) ‖}^{2}}{α_{T}} + \sum_{t = 1}^{T} \frac{β_{1 t} {dBD}_{\infty}^{2}}{α_{t}} + (1 + β_{1}) \sum_{t = 1}^{T} α_{t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} \\ + \sum_{t = 2}^{T} E (\frac{\sum_{j = 1}^{d} {\hat{υ}}_{t, j}^{1 / 2} {(x_{t, j} - x_{j}^{*})}^{2}}{α_{t}} - \frac{\sum_{i = j}^{d} {\hat{υ}}_{t - 1, j}^{1 / 2} {(x_{t, j} - x_{j}^{*})}^{2}}{α_{t - 1}}) \\ + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \\ = \frac{E {‖ {\hat{V}}_{1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{1}} - \frac{E {‖ {\hat{V}}_{T}^{1 / 4} (x_{T} - x^{★}) ‖}^{2}}{α_{T}} \\ + \sum_{t = 1}^{T} \frac{β_{1 t} {dBD}_{\infty}^{2}}{α_{t}} + (1 + β_{1}) \\ \sum_{t = 1}^{T} α_{t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + \sum_{t = 2}^{T} \sum_{j = 1}^{d} E {(x_{t, j} - x_{j}^{*})}^{2} (\frac{{\hat{υ}}_{t, j}^{\frac{1}{2}}}{α_{t}} - \frac{{\hat{υ}}_{t - 1, j}^{\frac{1}{2}}}{α_{t - 1}}) \\ + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{E {‖ {\hat{V}}_{1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{1}} \\ + D_{\infty}^{2} \sum_{j = 1}^{d} E \frac{{\hat{υ}}_{T, j}^{1 / 2}}{α_{T}} + \sum_{t = 1}^{T} \frac{β_{1 t} {dBD}_{\infty}^{2}}{α_{t}} + (1 + β_{1}) \sum_{t = 1}^{T} α_{t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} \\ + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \end{matrix}

(53)

where the last inequality holds due to $E | | {\hat{V}}_{T}^{1 / 4} (x_{T} - x^{★}) | |^{2} / α_{T} \geq 0$ , and $(\sqrt{{\hat{υ}}_{t, i}} / α_{t}) \geq (\sqrt{{\hat{υ}}_{t - 1, i}} / α_{t - 1})$ .

Next, we will discuss the case of $α_{t} = α_{0}$

\begin{matrix} 2 ω (1 - κ) (1 - β_{1}) \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \leq \frac{E {‖ {\hat{V}}_{1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{1}} \\ + \sum_{t = 1}^{T} \frac{β_{1 t} {dBD}_{\infty}^{2}}{α_{t}} + (1 + β_{1}) \sum_{t = 1}^{T} α_{t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + D_{\infty}^{2} \sum_{i = 1}^{d} \\ E \frac{{\hat{υ}}_{T, i}^{1 / 2}}{α_{T}} + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{2 {dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} γ {dBD}_{\infty}^{2}}{(1 - γ) α_{0}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{2 {dBD}_{\infty}^{2}}{α_{0}} \\ + \frac{β_{1} γ {dBD}_{\infty}^{2}}{(1 - γ) α_{0}} + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + 2 \sum_{t = 1}^{T} E [ξ_{t} (x_{t}, z_{0 : t})] \end{matrix}

(54)

where the second inequality holds since $∥ {\hat{V}}_{t} ∥ \leq \sqrt{d} ∥ {\hat{V}}_{t} ∥_{\infty} \leq \sqrt{d} B^{2}$ , $∥ x_{t} - x^{★} ∥ \leq D_{\infty}$ , $β_{1, t} = β γ^{t}$ , and $| | {\hat{V}}_{t}^{- 1 / 4} m_{t} | | \leq B^{2} / B_{0}$ , and the last inequality follows from $(1 - β_{1 t}) \leq 1$

‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} = \sum_{i = 1}^{d} \frac{m_{t, i}^{2}}{\sqrt{{\hat{υ}}_{t, i}}} \leq \frac{1}{B_{0}} \sum_{i = 1}^{d} m_{t, i}^{2} \leq \frac{B^{2}}{B_{0}}

(55)

We assume that ${\hat{υ}}_{t, i} \geq {\hat{υ}}_{1, i} \geq B_{0}^{2}$

When $T \geq 2 ε^{*}$ , we obtain

\begin{matrix} \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{2 {dBD}_{\infty}^{2}}{α_{0}} \\ + \frac{β_{1} γ {dBD}_{\infty}^{2}}{α_{0} (1 - γ)} + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + 2 \sum_{t = 1}^{T} E [ξ_{t} (x_{t}, z_{0 : t})]} \\ \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{2 {dBD}_{\infty}^{2}}{α_{0}} + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} \\ + \frac{β_{1} γ {dBD}_{\infty}^{2}}{(1 - γ) α_{0}} + 2 \sum_{t = 1}^{2 ε^{*}} [\frac{8 B^{2} α_{0} ε^{*}}{B_{0}} + B D_{\infty} {(γ λ)}^{t}] \\ + 2 \sum_{t = 2 ε^{*} + 1}^{T} \frac{B^{2} (8 ε^{*} + 3 B_{0}) α_{t - 2 ε^{*}}}{B_{0}}} \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} \\ {\frac{2 {dBD}_{\infty}^{2}}{α_{0}} + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} γ {dBD}_{\infty}^{2}}{(1 - γ) α_{0}} + \frac{32 B^{2} ε^{* 2} α_{0}}{B_{0}} \\ + 4 ε^{*} B D_{\infty} + \frac{2 B^{2} (8 ε^{*} + 3 B_{0}) α_{0} (T - 2 ε^{*})}{B_{0}}} \end{matrix}

(56)

the second inequality holds because of Lemma 6, and the last inequality holds since $(γ λ)^{t} \leq 1$ . Therefore, we can obtain

\begin{matrix} E ‖ x_{T} - x^{★} ‖^{2} = \frac{1}{T} \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \\ \leq \frac{1}{2 ω T (1 - κ) (1 - β_{1})} {\frac{2 {dBD}_{\infty}^{2}}{α_{0}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} γ {dBD}_{\infty}^{2}}{(1 - γ) α_{0}} \\ + \frac{32 B^{2} ε^{* 2} α_{0}}{B_{0}} + 4 ε^{*} B D_{\infty} \\ + \frac{2 B^{2} (8 ε^{*} + 3 B_{0}) α_{0} (T - 2 ε^{*})}{B_{0}}} \leq \frac{C_{1}}{T} + α_{0} C_{2} \end{matrix}

(57)

where

\begin{matrix} C_{1} = \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{2 {dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} γ {dBD}_{\infty}^{2}}{(1 - γ) α_{0}} \\ + \frac{32 B^{2} ε^{* 2} α_{0}}{B_{0}} + 4 ε^{*} B D_{\infty} - \frac{4 ε^{*} B^{2} (8 ε^{*} + 3 B_{0}) α_{0}}{B_{0}}} \end{matrix}

(58)

C_{2} = \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{(1 + β_{1}) B^{2} + 2 B^{2} (8 ε^{*} + 3 B_{0})}{B_{0}}}

When $T \leq 2 ε^{*}$ , we have

\begin{matrix} \sum_{t = 1}^{T} E {‖ x_{t} - x^{⋆} ‖}^{2} \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} γ d B D_{\infty}^{2}}{α_{0} (1 - γ)} \\ + \frac{2 d B D_{\infty}^{2}}{α_{0}} + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})]} \\ \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} γ d B D_{\infty}^{2}}{α_{0} (1 - γ)} \\ + \frac{2 d B D_{\infty}^{2}}{α_{0}} + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) [\frac{8 B^{2} α_{0} ε^{*}}{B_{0}} + B D_{\infty} {(γ λ)}^{t}]} \\ \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} γ d B D_{\infty}^{2}}{α_{0} (1 - γ)} \\ + \frac{2 d B D_{\infty}^{2}}{α_{0}} + \frac{16 B^{2} α_{0} T ε^{*}}{B_{0}} + 2 T B D_{\infty}} \end{matrix}

(59)

In the same way, we can obtain

\begin{matrix} E ‖ x_{T} - x^{★} ‖^{2} = \frac{1}{T} \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \leq \frac{1}{2 ω T (1 - κ) (1 - β_{1})} \\ {\frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} γ {dBD}_{\infty}^{2}}{α_{0} (1 - γ)} + \frac{2 {dBD}_{\infty}^{2}}{α_{0}} + \frac{16 B^{2} α_{0} T ε^{*}}{B_{0}} + 2 TB D_{\infty}} \\ \leq \frac{C_{3}}{T} + α_{0} C_{4} \end{matrix}

(60)

where

\begin{matrix} C_{3} : = \frac{1}{2 ω (1 - κ) (1 - β_{1})} [\frac{β_{1} γ {dBD}_{\infty}^{2}}{α_{0} (1 - γ)} + \frac{2 {dBD}_{\infty}^{2}}{α_{0}}] \\ C_{4} : = \frac{1}{2 ω (1 - κ) (1 - β_{1})} [\frac{(1 + β_{1}) B^{2} + 16 B^{2} ε^{*}}{B_{0}} + \frac{2 B D_{\infty}}{α_{0}}] \end{matrix}

Up to now, we have proved Theorem 1.

Now, we will consider decreasing step-size and the detailed proof is as follows.

Proof

The proof is similar to Theorem 1 before (53). Let $α_{t} = α_{0} / \sqrt{t + 1}$ , we can prove as follows

\begin{matrix} 2 ω (1 - κ) (1 - β_{1}) \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \leq \frac{E {‖ {\hat{V}}_{1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{1}} \\ + D_{\infty}^{2} \sum_{i = 1}^{d} E \frac{{\hat{υ}}_{T, i}^{1 / 2}}{α_{T}} + (1 + β_{1}) \sum_{t = 1}^{T} α_{t} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + \sum_{t = 1}^{T} \\ \frac{β_{1 t} {dBD}_{\infty}^{2}}{α_{t}} + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{E {‖ {\hat{V}}_{1}^{1 / 4} (x_{t} - x^{★}) ‖}^{2}}{α_{0}} \\ + D_{\infty}^{2} \sum_{i = 1}^{d} E \frac{{\hat{υ}}_{T, i}^{1 / 2}}{α_{T}} + \sum_{t = 1}^{T} \frac{\sqrt{t} β_{1 t} {dBD}_{\infty}^{2}}{α_{0}} + α_{0} (1 + β_{1}) \\ \sum_{t = 1}^{T} ‖ {\hat{V}}_{t}^{- 1 / 4} m_{t} ‖^{2} + 2 \sum_{t = 1}^{T} (1 - β_{1 t}) E [ξ_{t} (x_{t}, z_{0 : t})] \leq \frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} {dBD}_{\infty}^{2}}{α_{0}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} {dBD}_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} + 2 \sum_{t = 1}^{T} E [ξ_{t} (x_{t}, z_{0 : t})] \end{matrix}

(61)

where the second inequality holds because $α_{0} \geq α_{t}$ and the last inequality establishes since $\sum_{t = 1}^{T} \frac{β_{1 t} \sqrt{t}}{α_{0}} \leq \sum_{t = 1}^{T} \frac{β_{1} γ^{t - 1} t}{α_{0}} = \frac{β_{1}}{α_{0}} (\frac{1}{(1 - γ)} \sum_{t = 1}^{T} γ^{t - 1} - T γ^{T}) \leq \frac{β_{1}}{α_{0} {(1 - γ)}^{2}}$

When $T \geq 2 ε^{*}$ , we have

\begin{matrix} \sum_{t = 1}^{T} E {‖ x_{t} - x^{⋆} ‖}^{2} \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{d B D_{\infty}^{2}}{α_{0}} + \frac{α_{0} {(1 + β_{1})}^{2} T B^{2}}{B_{0}} \\ + \frac{\sqrt{T} d B D_{\infty}^{2}}{α_{0}} + \frac{β_{1} d B D_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} + 2 \sum_{t = 1}^{T} E [ξ_{t} (x_{t}, z_{0 : t})]} \\ \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{d B D_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} d B D_{\infty}^{2}}{α_{0}} + \frac{β_{1} d B D_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + 2 \sum_{t = 1}^{2 ε^{*}} [\frac{8 B^{2} α_{0} ε^{*}}{B_{0}} + B D_{\infty} {(γ λ)}^{t}] \\ + 2 \sum_{t = 2 ε^{*} + 1}^{T} \frac{B^{2} (8 ε^{*} + 3 B_{0}) α_{t - 2 ε^{*}}}{B_{0}}} \\ \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{d B D_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} d B D_{\infty}^{2}}{α_{0}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} d B D_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} + \frac{32 B^{2} α_{0} ε^{* 2}}{B_{0}} \\ + 4 ε^{*} B D_{\infty} + \frac{2 B^{2} (8 ε^{*} + 3 B_{0}) α_{0} (T - 2 ε^{*})}{B_{0}}} \end{matrix}

(62)

Hence, we can simply obtain

\begin{matrix} E ‖ x_{T} - x^{★} ‖^{2} = \frac{1}{T} \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \\ \leq \frac{1}{2 ω T (1 - κ) (1 - β_{1})} {\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} {dBD}_{\infty}^{2}}{α_{0}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} {dBD}_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} + \frac{32 B^{2} α_{0} ε^{* 2}}{B_{0}} \\ + 4 ε^{*} B D_{\infty} + \frac{2 B^{2} (8 ε^{*} + 3 B_{0}) α_{0} (T - 2 ε^{*})}{B_{0}}} \\ \leq \frac{C_{5}}{\sqrt{T}} + \frac{C_{6}}{T} + α_{0} C_{7} \end{matrix}

(63)

where

\begin{matrix} C_{5} : = \frac{{dBD}_{\infty}^{2}}{2 α_{0} ω (1 - κ) (1 - β_{1})} \end{matrix}

\begin{matrix} \begin{matrix} C_{6} : = \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} {dBD}_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} \\ + \frac{32 B^{2} α_{0} ε^{* 2}}{B_{0}} + 4 ε^{*} B D_{\infty} - \frac{4 ε^{*} B^{2} (8 ε^{*} + 3 B_{0}) α_{0}}{B_{0}}} \end{matrix} \end{matrix}

C_{7} : = \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{(1 + β_{1}) B^{2} + 2 B^{2} (8 ε^{*} + 3 B_{0})}{B_{0}}}

When $T \leq 2 ε^{*}$

\begin{matrix} \sum_{t = 1}^{T} E ‖ x_{t} - x^{★} ‖^{2} \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} {dBD}_{\infty}^{2}}{α_{0}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} {dBD}_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} + 2 \sum_{t = 1}^{T} E [ξ_{t} (x_{t}, z_{0 : t})]} \\ \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} {dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} {dBD}_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + B D_{\infty} {(γ λ)}^{t} + 2 \sum_{t = 1}^{T} \frac{8 B^{2} α_{0} ε^{*}}{B_{0}}} \\ \leq \frac{1}{2 ω (1 - κ) (1 - β_{1})} {\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} {dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} {dBD}_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} \\ + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + 2 TB D_{\infty} + \frac{16 T B^{2} α_{0} ε^{*}}{B_{0}}} \end{matrix}

In the same way, we can obtain

\begin{matrix} E ‖ x_{T} - x^{★} ‖^{2} = \frac{1}{T} \sum_{t = 1}^{T} E ‖ x_{T} - x^{★} ‖^{2} \leq \frac{1}{2 ω T (1 - κ) (1 - β_{1})} \\ {\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{\sqrt{T} {dBD}_{\infty}^{2}}{α_{0}} + \frac{α_{0} (1 + β_{1}) T B^{2}}{B_{0}} + \frac{β_{1} {dBD}_{\infty}^{2}}{α_{0} {(1 - γ)}^{2}} \\ + \frac{16 T B^{2} α_{0} ε^{*}}{B_{0}} + 2 TB D_{\infty}} \leq \frac{C_{5}}{\sqrt{T}} + \frac{C_{8}}{T} + α_{0} C_{4} \end{matrix}

(64)

where

C_{8} : = \frac{1}{2 ω T (1 - κ) (1 - β_{1})} [\frac{{dBD}_{\infty}^{2}}{α_{0}} + \frac{β_{1} dB D_{\infty}}{α_{0} {(1 - γ)}^{2}}]

Simulations

In this section, we verify the theoretical analysis results through a small ball experiment, which is similar to the experiment in Lin and Ling³⁸ and Lowe et al.³⁹ The agent interacts with the environment and the goal is to find the target node through continuous learning. The return obtained by the agent after each action is inversely proportional to the proximity of the target node, that is, the closer it is to the target node, the greater the return. The settings of experimental parameters are as follows: the discount factor $γ = 0.95$ , and hyper-parameters $β_{1} = 0.9$ , $β_{2} = 0.999$ . We conducted four groups of comparative experiments, and the experimental results are shown in Figures 1 –4. Next, we will introduce these figures separately.

Figure 1.

$∥ x ∥$ at the same initial step-size $α_{0} = 0.001$ and different $λ$ . (a) Decreasing step-size and (b) constant step-size

Figure 2.

$∥ x ∥$ at the same initial step-size $α_{0} = 0.0001$ and different $λ$ . (a) Decreasing step-size and (b) constant step-size

Figure 3.

$∥ x ∥$ at the same $λ = 0.3$ and different initial step-size $α_{0}$ . (a) Decreasing step-size and (b) constant step-size

Figure 4.

$∥ x ∥$ at the same $λ = 0.9$ and different initial step-size $α_{0}$ . (a) Decreasing step-size and (b) constant step-size

The initial step-size in Figures 1 and 2 is $α_{0} = 0.001$ and $α_{0} = 0.0001$ , respectively, and we compared the ordinary TD( $λ$ ) algorithm with the ADTD( $λ$ ) algorithm under the case of decreaing step-size (Figures 1(a) and 2(a)) and constant step-size (Figures 1(b) and 2(b)). The decreasing step-size is related to $t$ and the step-size will be smaller with the increase of time. Because our ordinate is $∥ x ∥$ , the change of parameter update of ADTD( $λ$ ) algorithm in the case of decreasing step-size will be smaller than that of constant step-size, and because ADTD( $λ$ ) algorithm combines AMSGrad algorithm, it will perform faster and more stably to ensure convergence. Experimental results show that our ADTD( $λ$ ) algorithm converges faster and is more stable in either case.

In Figure 3, we fix $λ = 0.3$ and make the following experiments for decreasing step-size (Figure 3(a)) and constant step-size (Figure 3(b)). Overall, ADTD( $λ$ ) algorithm can always converge first at different initial step-size. In Figure 3(b), the parameter update adopts constant step-size, so the parameter update amplitude will increase as the $α_{0}$ setting increases; it can also be seen that the greater the fluctuation of TD( $λ$ ) algorithm with the increase of $α$ , the higher the curve for ADTD( $λ$ ) algorithm. Therefore, for constant step-size, the smaller the $α_{0}$ , the more stable the curve. Although the convergence speed is sometimes similar in Figure 3(b), ADTD( $λ$ ) algorithm is more stable than TD( $λ$ ) algorithm because it uses adaptive algorithm.

In Figure 4, we fixed $λ = 0.9$ and make the following experiments for decreasing step-size (Figure 4(a)) and constant step-size (Figure 4(b)). It should be clear that in Figure 4(b), when $α = 0.01$ , the algorithm cannot converge. The reason is that the larger the lambda and alpha, the worse the performance because $λ \in [0, 1]$ . At this time, $λ = 0.9$ , the scope of $α$ is more strict.

Conclusion

We use reinforcement learning to solve the optimal routing problem based on wireless sensor networks. To this end, we propose an adaptive TD( $λ$ ) algorithm, referred to as ADTD( $λ$ ). We prove the non-asymptotic analysis of the adaptive TD( $λ$ ) algorithm under Markovian sampling. Furthermore, for ADTD( $λ$ ) with linear function approximation, we show that the algorithm can converge to the global optimal neighborhood in the case of constant step-size and decreasing step-size. However, under some parameter settings, the oscillation of the algorithm is large and the performance of the algorithm is not much different from that of the original $TD (λ)$ algorithm; combining it with variance reduction technology will further improve the performance of the algorithm.

Footnotes

Handling Editor: Francesc Pozo

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by National Natural Science Foundation of China (61871430, 61976243);the Key Technologies R&D Program of HenanProvince (222102210049); the Scientific and Technological Innovation Team of Colleges and Universities in Henan Province of China (20IRTSTHN018, 21IRTSTHN015

ORCID iDs

Muhua Liu

Mingchuan Zhang

Qingtao Wu

Data availability statement

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

Song

Zhou

, et al. Smart collaborative automation for receive buffer control in multipath industrial networks. IEEE Trans Industr Inform 2020; 16(2): 1385–1394.

Jiang

, et al. Intelligent digital image firewall system for filtering privacy or sensitive images. Cogn Syst Res 2019; 53: 85–97.

Song

Zhang

, et al. Smart collaborative balancing for dependable network components in cyber-physical systems. IEEE Trans Industr Inform 2021; 17(10): 6916–6924.

Dai

, et al. An intelligent scheme for congestion control: When active queue management meets deep reinforcement learning. Comput Netw 2021; 200: 108515.

Khan

Hussain

Nazir

, et al. Efficient and reliable hybrid deep learning-enabled model for congestion control in 5G/6G networks. Comput Commun 2022; 182: 31–40.

Song

Zhou

Wang

, et al. Smart collaborative distribution for privacy enhancement in moving target defense. Inf Sci 2019; 479: 593–606.

Lee

Kim

, et al. Deep learning-based real-time query processing for wireless sensor network. Int J Distrib Sens Netw 2017; 13(5): 1550147717707896.

Guo

Yan

. Optimizing the lifetime of wireless sensor networks via reinforcement-learning-based routing. Int J Distrib Sens Netw 2019; 15(2): 1550147719833541.

Najjar-Ghabel

Farzinvash

Razavi

. Mobile sink-based data gathering in wireless sensor networks with obstacles using artificial intelligence algorithms. Ad Hoc Netw 2020; 106: 102243.

10.

Sutton

. Learning to predict by the methods of temporal differences. Mach Learn 1988; 3: 9–44.

11.

Dayan

Sejnowski

. TD(lambda) converges with probability 1. Mach Learn 1994; 14(1): 295–301.

12.

Schapire

Warmuth

. On the worst-case analysis of temporal-difference learning algorithms. Mach Learn 1996; 22(1–3): 95–121.

13.

Tsitsiklis

Roy

. An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Control 1997; 42(5): 674–690.

14.

Tsitsiklis

Van Roy

. Average cost temporal-difference learning. Automatica 1999; 35(11): 1799–1808.

15.

Syed

. Characterizing the exact behaviors of temporal difference learning algorithms using Markov jump linear system theory. In: Proceedings of the 32nd annual conference Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019, pp.8477–8488. San Diego, CA: Neural Information Processing Systems.

16.

Tsitsiklis

Roy

. Analysis of temporal-difference learning with function approximation. In: Proceedings of the 9th conference Neural Information Processing Systems, Denver, CO, 2–5 December 1996, pp.1075–1081. San Diego, CA: Neural Information Processing Systems.

17.

Devraj

Meyn

. Zap Q-learning. In: Proceedings of the 30th conference Neural Information Processing Systems, Long Beach, CA, 4–9 December 2017, pp.2235–2244. San Diego, CA: Neural Information Processing Systems.

18.

Korda

Prashanth

. On TD(0) with function approximation: concentration bounds and a centered variant with exponential convergence. PMLR 2015; 37: 626–634.

19.

Dalal

Szörényi

Thoppe

, et al. Finite sample analyses for TD(0) with function approximation. In: Proceedings of the 32nd AAAI conference on artificial intelligence, New Orleans, LA, 2 February 2018, pp.6144–6160. Menlo Park, CA: Association for the Advancement of Artificial Intelligence.

20.

Xiong

Liang

, et al. Non-asymptotic convergence of Adam-type reinforcement learning algorithms under Markovian sampling. In: Proceedings of the 34th AAAI conference on artificial intelligence, Vancouver, BC, Canada, 2–9 February 2021, pp.10460–10468. Menlo Park, CA: Association for the Advancement of Artificial Intelligence.

21.

Wang

Giannakis

, et al. Decentralized TD tracking with linear function approximation and its finite-time analysis. In: Proceedings of the 34th international conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. Red Hook, NY: Curran Associates Inc.

22.

Altahhan

. True online TD(λ)-replay an efficient model-free planning with full replay. In: Proceedings of the international joint conference on neural networks (IJCNN), Glasgow, 19–24 July 2020. New York: IEEE.

23.

Chen

Maguluri

Shakkottai

, et al. A Lyapunov theory for finite-sample guarantees of asynchronous Q-learning and TD-learning variants. CoRR 2021; abs/2102.01567, https://arxiv.org/abs/2102.01567

24.

Srikant

Ying

. Finite-time error bounds for linear stochastic approximation and TD learning. PMLR 2019; 99: 2803–2830.

25.

Bhandari

Russo

Singal

. A finite time analysis of temporal difference learning with linear function approximation. PMLR 2018; 75: 1691–1692.

26.

Gupta

Srikant

Ying

. Finite-time performance bounds and adaptive learning rate selection for two time-scale reinforcement learning. In: Proceedings of the 33rd annual conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019, pp.4706–4715. San Diego, CA: Neural Information Processing Systems.

27.

Stooke

Abbeel

. Accelerated methods for deep reinforcement learning. CoRR 2018; abs/1803.02811, https://arxiv.org/abs/1803.02811

28.

Papini

Pirotta

Restelli

. Adaptive batch size for safe policy gradients. In: Proceedings of the 30th international conference on Neural Information Processing Systems, 4–9 December 2017, Long Beach, CA, pp.3591–3600. San Diego, CA: Neural Information Processing Systems.

29.

Sun

Shen

Chen

, et al. Adaptive temporal difference learning with linear function approximation. CoRR 2020; abs/2002.08537, https://arxiv.org/abs/2002.08537

30.

Dayan

. The convergence of TD(λ) for general λ. Mach Learn 1992; 8: 341–362.

31.

Maei

Sutton

. GQ (λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In: Proceedings of the 3rd conference on Artificial General Intelligence, vol. 1, Lugano, 5–8 March 2010. pp.91–96. Amsterdam: Atlantis Press.

32.

White

Investigating practical linear temporal difference learning. In: Proceedings of the 15th international conference on Autonomous Agents and Multiagent Systems, Singapore, 9–13 May 2016, pp.494–502. New York: ACM.

33.

Sutton

Mahmood

White

. An emphatic approach to the problem of off-policy temporal-difference learning. J Mach Learn Res 2016; 17: 1–29.

34.

Kingma

. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, 7–9 May 2015.

35.

Reddi

Kale

Kumar

. On the convergence of Adam and beyond. CoRR 2019; abs/1904.09237, https://arxiv.org/abs/1904.09237

36.

Zou

Shen

. On the convergence of AdaGrad with momentum for training deep neural networks. CoRR 2018; abs/1808.03408. http://arxiv.org/abs/1808.03408

37.

Orabona

. On the convergence of stochastic gradient descent with adaptive stepsizes. In: Proceedings of the 22nd international conference on Artificial Intelligence and Statistics, vol. 89, Naha, Japan, 16–18 April 2019, pp.983–992. New: PMLR.

38.

Lin

Ling

. Decentralized TD(0) with gradient tracking. IEEE Signal Process Lett 2021; 28: 723–727.

39.

Lowe

Tamar

, et al. Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems 30: annual conference on Neural Information Processing Systems, Long Beach, CA, 4–9 December 2017, pp.6379–6390. San Diego, CA: Neural Information Processing Systems.

A non-asymptotic analysis of adaptive TD(λ) learning in wireless sensor networks

Abstract

Keywords

Introduction

Related work

TD learning

Adaptive algorithms

Preliminaries

TD algorithm

TD(λ) algorithm

Algorithm design and assumptions

Algorithm design

Assumptions

Assumption 1

Assumption 2

Assumption 3

Assumption 4

Main results

Theorem 1

Remark 1

Theorem 2

Remark 2

Convergence analysis

Lemma 1

Lemma 2

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Proof

Proof

Simulations

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

Data availability statement

References