R-learning-based team game model for Internet of things quality-of-service control scheme

Abstract

In modern times, it has been observed that Internet of things technology makes it possible for connecting various smart objects together through the Internet. For the effective Internet of things management, it is necessary to design and develop service models that ensure appropriate level of quality-of-service. Therefore, the design of quality-of-service management schemes has been a hot research issue. In this work, we formulate a new quality-of-service management scheme based on the IoT system power control algorithm. Using the emerging and largely unexplored concept of the R-learning algorithm and docitive paradigm, system agents can teach other agents how to adjust their power levels while reducing computation complexity and speeding up the learning process. Therefore, our proposed power control approach can provide the ability to practically respond to current Internet of things system conditions and suitable for real wireless communication operations. Finally, we validate the introduced concept and confirm the effectiveness of the proposed scheme in comparison with the existing schemes through extensive simulation analysis.

Keywords

R-learning power control algorithm Internet of things team game model docitive network paradigm interactive mechanism

Introduction

The Internet of things (IoT) is regarded as a new technology and economic wave in the global information industry after the Internet.^1–3 In order to achieve anytime and anywhere functionality, the IoT equipment must connect, interact, and collaborate with the surrounding environment. For the effective IoT management, more quality of service (QoS) attributes must be considered, including information accuracy, coverage of IoT, required network resources, and energy consumption. Therefore, with regard to IoT services, QoS has been a popular research issue. Typically, different IoT applications have different QoS requirements. Nowadays, the requirement of QoS is considered to be an important non-functional element for the IoT system and must be guaranteed while implementing effective resource allocation and scheduling algorithms.^4,5

In current methods, in order to provide QoS to applications, real-time QoS schedulers are introduced in the IoT structure.³ A QoS scheduler is an IoT system component designed to control the IoT system resources for various application services. In each local area, QoS schedulers assign the system resources to contending agents based on a set of criteria, namely, transmitter power, transmission rate, and QoS constraint.⁵ During IoT system operations, these QoS schedulers aim to maximize system utilization while providing QoS requirements to classes of applications that have very tight requirements such as bit rate and delay. However, it is challenging to practically balance between network availability and QoS ensuring.^3,5,6

Game theory is a powerful tool to study situations of conflict and cooperation and is concerned with finding the best actions for individual decision makers, that is, players in these situations and recognizing stable outcomes.⁷ It has been extensively used in microeconomics field, and only during the last years, it has received attentions as an effective method to design and model a distributed QoS control problems in telecommunications.⁷ In this article, we develop a new game model called team game (TG). In our TG model, QoS schedulers are assumed as game players. All game players organize a team, and actions of all players are coordinated to ensure team cooperation by considering a combination of individual payoff as a team payoff. The main concept of TG is to extend the well-known Markov decision problem (MDP) to the multi-player case.

Traditionally, game theory assumes that players have perfect information about the game, enabling them to calculate all of the possible consequences of each strategy selection.⁷ However, in real-world situations, a player must make decisions based on less-than perfect information. If a player does not have total information about the game, then it follows that the player’s reasoning must be heuristic. Therefore, to maximize IoT performance, the way in which agents learn network situations and make the best decisions by predicting the influence of others’ possible decisions is an important research issue in the field of IoT networking.^8–10

In recent years, many learning algorithms have been developed in an attempt to maximize system performance in non-deterministic settings.^8,11,12 Generally, learning algorithms guarantee that collective behavior converges to a coarse equilibrium status. In order to make control decisions in real time, QoS schedulers should be able to learn from dynamic system environments and adapt to the current network condition. Therefore, QoS schedulers using learning techniques acquire information from the environment, build knowledge, and ultimately improve their performance.^3,8,10

In 1993, A Schwartz¹³ introduced an average-reward reinforcement learning algorithm called R-learning algorithm. Like Q-learning, R-learning algorithm uses the action value representation. In addition, it also needs to learn the estimate of the average reward. Therefore, R-learning algorithm is performed by the two time-scale learning process. In contrast to the value-iteration-based learning algorithms, the decision-learning approach in the R-learning algorithm allows an agent to directly learn the stationary randomized policy and directly updates the probabilities of actions based on the utility feedback.¹⁴

Recently, there has been increasing interest in research of various R-learning algorithms. However, in this field, many problems still remain open. Even though the R-learning algorithm has received strong attentions, designing a novel R-learning algorithm for real-world problems is still difficult. There are many complicated restrictions, which are often self-contradictory and variable with the dynamic real-world IoT environment.

Docitive paradigm is an emerging technology to overcome the current limitations of R-learning algorithm.¹⁵ Not only based on the cooperation to learn but also on the process of knowledge transfer, this paradigm can significantly speed up the learning process and increase precision. The docitive paradigm can provide a timely solution based on knowledge sharing in a cooperative fashion with other players in the IoT system, which allows game players to develop new capacities for selecting appropriate actions.^15,16

In this article, we develop a new IoT system power control scheme to ensure QoS provisioning. In the proposed scheme, we focus on how to integrate our TG game model and R-learning algorithm to tackle the QoS control problem in IoT systems. To implement the self-adaptability and real-time effectiveness, we adopt the docitive paradigm. Therefore, game players try to select optimal strategies in a distributed manner while approximating a common objective. According to the iterative TG game model, the proposed scheme attempts to ensure that individual decisions of players result in jointly optimal decisions for the players’ team. The main contributions of our study are as follows: (1) we develop a novel power control algorithm for IoT system, (2) we integrate the game theory and R-learning algorithm to tackle the power decisions, (3) we adopt the distributed learning approach to implement the self-adaptability and real-time effectiveness, and (4) we adopt the docitive paradigm to provide a timely solution. The most important novelty of our proposed scheme is the responsiveness to current IoT system conditions depending on the exchange of information and expert knowledge from other players, the so-called docitive players.

Related work

Over the years, a lot of state-of-the-art research work on the QoS control problem has been conducted. The QoS-based device-to-device communication (QDDC) scheme in the study of Dai et al.¹⁷ was a novel device-to-device communication algorithm to enhance the spectrum efficiency. This scheme exploited the trade-off in power allocation of device-to-device transmitters while maximizing the number of allowed accessing device-to-device pairs under the QoS constraints. Then, the released communication resource would be separated into different device-to-device pairs. The QDDC scheme can be easily extended to the uplink case and the multiple channel case.¹⁷

The deterministic sequencing of exploration and exploitation (DSEE) scheme¹⁸ was developed as a new approach to the multi-armed bandit (MAB) problem, which is a class of sequential learning and decision problems with unknown models. Based on a DSEE, the DSEE scheme can find the minimum cardinality of the exploration sequence that ensures a reward loss in the exploitation sequence caused by incorrectly identified arm rank having an order no larger than the cardinality of the exploration sequence.¹⁸

The non-Bayesian social learning (NBSL) scheme by Jiang et al.¹⁹ studied how users in a dynamic system learned the uncertain system state and made multiple concurrent decisions by not only considering the current myopic utility but also taking into account the influence of subsequent users’ decisions. This scheme designed the recursive best response algorithms to find the subgame perfect Nash equilibrium for users and characterize special properties of the Nash equilibrium profile under homogeneous setting.¹⁹

The multi-armed bandit with unknown dynamics (MBUD) scheme²⁰ also considered the restless MAB problem with unknown dynamics in which a player chose one out of N arms to play at each time. The reward state of each arm transited according to an unknown Markovian rule when it was played and evolved according to an arbitrary unknown random process when it was passive. The MBUD scheme constructed a policy with an interleaving exploration and exploitation epoch structure that achieved a regret with logarithmic order.²⁰ All the earlier work has attracted a lot of attention and introduced unique challenges to efficiently solve the QoS control problem. Compared to these schemes,^17,18 the proposed scheme attains better performance during the IoT system operations.

The rest of this article is organized as follows. First, the traditional MDP and R-learning algorithm are introduced in section “Markov decision process and R-learning algorithm.” Next, we formulate and explain the proposed TG model to solve the QoS problem in section “Proposed QoS control scheme for IoT systems.” In section “Performance evaluation,” we verify the effectiveness and efficiency of the proposed scheme from simulation results. Finally, we draw conclusions in section “Summary and conclusion.”

Markov decision process and R-learning algorithm

MDP is a mathematical framework for modeling decision making in situations and is useful for optimization problems solved via reinforcement learning. Based on inputs to a dynamic system, MDP probabilistically determines a successor state and continues for a finite or infinite number of stages.^7,8,12,21,22 Traditionally, MDP is defined as a tuple $S, A, T, C (s (t), a (t)), T$ , where $S = {s_{1}, s_{2}, \dots, s_{n}}$ is the set of all possible states and $A = {a_{1}, a_{2}, \dots, a_{m}}$ is a finite set of admissible actions. $T = {0, 1, \dots, t, t + 1, \dots}$ is a time, which is represented by a sequence of time steps. Let $s (t), a (t)$ be the state and action at time step t, respectively, where $s (t) \in S$ and $a (t) \in S$ . $C (s (t), a (t))$ is a cost at time t when the action $a (t)$ in state $s (t)$ occurs, and $C (s (t), a (t)) \to R$ . $T$ is a state transition function $T : S \times A \to Δ (S)$ , where $Δ (S)$ is the set of discrete probability distributions over the set S . With the current state and action, the $T$ function probabilistically specifies the next state of the environment.^3,8,21,22

The objective of the MDP is to find a policy that minimizes the cost of each state $s (t)$ . If $Δ$ is the complete decision policy, the optimal cost value of state $s (t)$ can be written as

V^{*} (s (t)) = \min_{a (t)} [C (s (0), a (0)) + \sum_{t = 1}^{\infty} (β^{t} \times C (s (t), a (t)))]

(1)

where $0 \leq β^{t} < 1$ is a discount factor for the future state. Based on the principle of Bellman’s optimality, equation (1) can be re-written by a recursive equation as^3,21

V^{*} (s) = \min_{a} [C (s, a) + β \times \sum_{s^{'} \in S} {P (s^{'} | s, a) \times V^{*} (s^{'})}], s . t ., a \in A

(2)

where $s'$ represents all possible next states of S , and $P (s' | s, a)$ is the state transition probability from the state s to the state $s'$ . If we define $Δ$ as the complete decision policy, we can specify the optimal policy as follows

Δ^{*} (s) = \underset{a}{\arg \min} [C (s, a) + β \times \sum_{s' \in S} {P (s' | s, a) \times V^{*} (s')}]

(3)

To solve equation (3), reinforcement learning algorithms are common way. There exists several reinforcement learning algorithms. In this study, we adopt a novel average-reward reinforcement learning algorithm called R-learning. Like other reinforcement learning algorithms, R-learning algorithm uses the action value representation.^13,14 The action value $N^{Δ} (s, a)$ represents the average adjusted value of doing an action a in state s once and then following policy $Δ$ subsequently.¹⁴ That is

N^{Δ} (s, a) = C (s, a) - η^{Δ} + \sum_{s^{'} \in S} {P (s^{'} | s, a) \times V^{*} (s^{'})} s . t ., V^{*} (s^{'}) = \min_{a \in A} N^{Δ} (s^{'}, a)

(4)

where $η^{Δ}$ is the average reward of policy $Δ$ . In the R-leaning algorithm, system agents learn how to operate in the environment based on the effect of action and reward signal. In order to expediting the interaction between agent and environment efficiently, R-leaning algorithm is performed dynamically by a two-time-scale learning process.^13,14 However, conventional R-learning algorithm still has the limitation that the learning process of reinforcement learning system will be very slow.

Proposed QoS control scheme for IoT systems

In this section, we examine the applicability of MDP to design a novel power control algorithm and develop a new TG model for IoT systems. The proposed model can significantly improve the success rate of IoT services.

QoS-aware service management in the TG model

To develop our QoS-aware service management scheme, we assume and simplify the real-world situation for practical implementations:

Power strategies in each QoS schedulers are quantified. Usually, practicality is decided based on the computation complexity. Therefore, power levels should be simplified.

We assumed four QoS schedulers’ system in a local area like a conference room or small office part. Therefore, the proposed scheme has been developed as a four-player game model.

Pre-defined minimum bound $(ε)$ is introduced for stable system status. It can be defined according to a subjective point of view for the system stability.

Heterogeneous traffic services are categorized into two classes according to the required QoS: class I (real-time) traffic services and class II (non real-time) traffic services. Class I data services are highly delay sensitive, and strict deadlines are applied. However, some flexible data services are called as class II traffic services; they are rather tolerant of delays.

If our model is applied in the situation with hundreds or thousands of QoS schedulers in a huge area, each QoS schedulers must be grouped and clustered in a distributed manner. Using a locally distributed approach, the proposed scheme is applied iteratively in each cluster.

In this study, we consider a new power control mechanism for IoT systems. Under the multi-QoS schedulers’ environment, we formulate the multiple decision-making process as a new TG model based on the multi-agent R-learning approach. Mathematically, the TG model $(T)$ can be defined as $T = {S, N, A_{i, i \in N}, T, U_{i, i \in N}, T}$ at each time period $t \in T$ of gameplay:

S is the set of all possible system states, which is the combinations of power levels of QoS schedulers.

N is the set of all QoS schedulers.

$A_{i} = {A_{1}^{i}, A_{2}^{i}, \dots, A_{m}^{i}}$ is the collection of power levels for the QoS scheduler $i \in N$ where m is the number of possible power levels. $A_{1}^{i}$ and $A_{m}^{i}$ are the pre-defined minimum and maximum power levels.

$U_{i}$ is the payoff received by the QoS scheduler $i \in N$ .

$T$ is the state transition function and gives the probability to change the next state.

Usually, traditional solution concept of game theory is obtained with the following impractical assumptions: (1) fully rational players, (2) complete information, and (3) static game-model setting. These assumptions only hold in the theoretical and idealistic analysis. In the real-world IoT operations, it is impossible to transform the dynamic setting to static simulation setup. It means that the traditional solution with fully rational players is technically impossible to be obtained in the real world. To practically design the TG model, we develop a new solution concept called stable equilibrium (SE). Based on the R-learning and docitive paradigm, SE is applicable to the repeated choice in learning situations. For our TG, we assume that SE is a discrete set probability distributions over the available strategies chosen by all players.

Power control algorithm based on the R-learning algorithm

In the proposed power control scheme, we focus on how we can tackle the QoS control problem based on the R-learning algorithm approach. According to the self-adaptability, QoS schedulers in the proposed scheme can update the strategy based on the observations while responding to current IoT system conditions. Usually, the main interest of each QoS scheduler is to maximize the amount of transmitted data with the lower power consumption. However, there is a fundamental trade-off. To capture this conflicting relationship, a utility function $(U)$ is defined by considering the ratio of the throughput to transmit power. In our game model, the utility function for the ith QoS scheduler $(U_{i})$ is defined as follows

U_{i} (A_{j}^{i}, A_{- i}) = \frac{T_{i} (A)}{A_{j}^{i}}, s . t ., A_{j}^{i} \in A_{i}, A = {Π_{i \in N} A_{i} | A_{i} \in [A_{1}^{i}, A_{m}^{i}]}

(5)

where $A_{- i}$ is the transmit power vector without $A_{i}$ , and $T_{i} (A)$ is the throughput of ith scheduler. Usually, throughput is defined as the number of information bits that are transmitted without error per unit-time. In wireless communications, it can be achieved with the signal-to-interference-plus-noise ratio (SINR) $(γ_{i})$ in the effective range.⁷ Therefore, the throughput of the ith scheduler $(T_{i} (A))$ can be expressed as

T_{i} (A) = W \times \log_{2} (1 + \frac{γ_{i} (A)}{Ω})

(6)

where W is the assigned channel bandwidth, and Ω (Ω ≥ 1) is the gap between uncoded M-ary quadrature amplitude modulation (M-QAM) and the capacity, minus the coding gain.⁷ Finally, the ith scheduler’s utility is defined as follows

\max_{A_{j}^{i} \in A_{i}} U_{i} = \max_{A_{j}^{i} \in A_{i}} {\frac{[W \times \log_{2} (1 + \frac{γ_{i} (A)}{Ω})]}{A_{j}^{i}}}

(7)

In the developed game model, different schedulers can receive different payoffs for the same state transition. Based on our TG approach, schedulers seek to choose their power levels self-interestedly to maximize their payoffs. According to the R-learning equation (4), the expected payoff of QoS scheduler is

N^{Δ} (s, a) = U (s, a) - η^{Δ} + \sum_{s^{'} \in S} {P (s^{'} | s, a) \times \max_{a \in A} N^{Δ} (s^{'}, a)}

(8)

In our TG game model, each QoS scheduler is interested in the goal of maximizing his utility function. In a distributed self-regarding fashion, each QoS scheduler in a dynamic IoT system learns the uncertain IoT situation and makes a power control decision by taking into account the online feedback mechanism. With an iterative repeating process, schedulers’ decision-making mechanism is developed based on the R-learning algorithm, which is an effective way for schedulers’ decision mechanism. Based on the dynamic learning mechanism, the developed algorithm can constantly adapt each QoS scheduler’s power level to get an appropriate performance balance between contradictory requirements.

Based on the feedback learning process, the proposed scheme can capture how schedulers adapt their power levels to achieve the better benefit. This procedure is defined as an online power control algorithm. In the proposed scheme, a selection probability for each power level strategy is dynamically changed based on the payoff ratio, which means the strategy convergence. Therefore, schedulers examine their payoffs periodically in an entirely distributed fashion. Without any impractical rationality assumptions, schedulers can modify their power levels in an effort to maximize their $N$ value in a distributed manner. Due to the online self-adjustment technique, this approach can significantly reduce the computational complexity and control overheads. Therefore, it is practical and suitable for the actual system implementation.

In equation (9), defining of $P (s' | s, a)$ is a probability decision problem. Through the multi-scheduler learning scheme, each scheduler adaptively learns the current IoT system situation to dynamically decide the $P (s' | s, a)$ . In the proposed scheme, each scheduler is assumed to be interconnected by letting them play in a TG with the same environment. Suppose there is a finite set of power levels $A_{i, 1 \leq i \leq N} (t) = {A_{1}^{i} (t) \dots A_{m}^{i} (t)}$ chosen by the scheduler i at the game iteration t. Correspondingly, the $U^{i} (t) = (u_{1}^{i} (t) \dots u_{m}^{i} (t))$ is a vector of specifying payoffs for the scheduler i. If the scheduler i plays the action $A_{l, 1 \leq l \leq m}^{i}$ , it earns a payoff $u_{l, 1 \leq l \leq m}^{i}$ with probability $p_{l}^{i}$ . Over these actions, $P^{i} (t) = {p_{1}^{i} (t), \dots, p_{m}^{i} (t)}$ is defined as the scheduler i’s probability distribution.

Power levels chosen by the schedulers are given as input to the environment, and the environmental response to these power levels serves as an input to each scheduler. Therefore, multiple schedulers are connected in a feedback loop with its environment. When a scheduler selects a power level with his respective probability distribution $P (\cdot)$ , the environment produces a payoff $U (\cdot)$ according to equation (5). Therefore, $P (\cdot)$ should be adjusted adaptively in order to cope with the payoff fluctuation.

At every game round, all schedulers update their probability distributions based on the R-learning algorithm. If the scheduler i chooses $A_{j}^{i}$ at the time t, this scheduler updates the jth action’s propensity $(J_{j}^{i})$ as follows

{\begin{matrix} J_{j}^{i} (t + 1) = J_{j}^{i} (t) + ξ_{j}^{i} (t), s . t ., ξ_{j}^{i} (t) = (\frac{(u_{j}^{i} (t) - u_{j}^{i} (t - 1))}{u_{j}^{i} (t - 1)}) \\ J_{k}^{i} (t + 1) = J_{k}^{i} (t), for all k \neq j \end{matrix}

(9)

The QoS schedulers have to learn an effective action in a distributed fashion while achieving the common objective of IoT system. Known as multi-agent learning approach, we solve this problem using distributed R-learning algorithm and docitive paradigm. In the proposed TG model, the main challenge is how to ensure that individual decisions of each QoS scheduler approximate jointly optimal decisions for the team. As a docitive player, individual QoS schedulers cooperate with others by exchanging information while learning the action’s propensity from other team members, who are also performing power controls from the R-learning algorithm. In order to apply this approach, QoS schedulers periodically exchange their updated $J$ values with team members. Based on the received $J$ values, the jth action’s final propensity in the scheduler $i (F_J_{j, 1 \leq j \leq m}^{i} (t + 1))$ is given by

\begin{array}{l} ℱ_J_{j}^{i} (t + 1) = (σ \times J_{j}^{i} (t + 1)) \\ + (\frac{(1 - σ)}{| Γ_{j} |} \times \sum_{k \in Γ_{j}} J_{j}^{k} (t + 1)) \end{array}

(10)

where $σ$ is a weighted control parameter and $Γ_{j}$ is the set of schedulers, who send the updated $J$ values about the jth action for the time $t + 1$ . Finally, we can get the $P (s' | s, a)$ value in equation (8). The ith scheduler’s $P (\cdot)$ value for the time $t + 1 (P^{i} (S' | S, A_{j}^{i} (t + 1)))$ is defined based on the proportion to each power level’s propensity

P^{i} (S' | S, A_{j}^{i} (t + 1)) = \frac{F_J_{j}^{i} (t + 1)}{\sum_{l = 1}^{m} F_J_{l}^{i} (t + 1)}

(11)

The main steps of proposed scheme

In this work, we discuss a new perspective in IoT systems to design the QoS control algorithm. In the proposed scheme, QoS schedulers adaptively decide their power levels while satisfying the QoS needs in coverage areas. Based on past actions and environmental feedback, we consider the R-learning algorithm and docitive paradigm that attempt to find out optimal actions effectively. Until now, several game models have been developed to help game players to learn from the dynamic network environment. An important feature in our TG model is to enable game players to reach quickly a certain desired game outcome.

From the result of individual learning experiences, each scheduler can learn how to effectively play under the dynamic network situations. Therefore, the payoff estimation at each game iteration can be used to update the $P (\cdot)$ in such a way that those power levels with large payoff are more likely to be chosen again in the next iteration. To maximize their expected payoffs, QoS schedulers adaptively change their current power levels. This adjustment process is sequentially repeated until the change of probability $(P (\cdot))$ is within a pre-defined minimum bound $(ε)$ . The main steps of the proposed scheme can be described as follows:

Step 1. At the initial time, $J (\cdot)$ is set to be equally distributed $(J (\cdot) = 1 / m, where m is the number of power levels)$ , and control parameters $σ$ , N, W, Ω, $α$ , $β$ , and $ε$ are given to each QoS scheduler from the simulation scenario (refer to Table 1).

Step 2. At the end of each game’s iteration, each QoS scheduler estimates independently its own payoff $(U (\cdot))$ using equation (5).

Step 3. Based on the currently received information, each QoS scheduler periodically adjusts $J (\cdot)$ values using equation (9).

Step 4. According to the docitive paradigm, each QoS scheduler receives the $J (\cdot)$ values from other QoS schedulers. Using equation (10), the final propensity for each action is dynamically estimated.

Step 5. Using the proportion to each strategy’s propensity, each $P (S' | S, A (\cdot))$ is defined according to equation (11).

Step 6. Iteratively, each QoS scheduler selects a strategy $(A)$ to maximize his long-term expected payoff $(N (\cdot))$ .

Step 7. The sequential R-learning process is repeatedly operated in a distributed manner

{\begin{matrix} N_{t + 1} (s, a) \leftarrow N_{t} (s, a) \times (1 - β) + β \times (r_{Θ} (s, s') - η_{t} + \max_{a' \in A} N_{t} (s', a')) \\ = N_{t} (s, a) + β \times (r_{Θ} (s, s') - η_{t} + \max_{a' \in A} N_{t} (s', a') - N_{t} (s, a)) \\ η_{t + 1} \leftarrow η_{t} \times (1 - α) + α [r_{Θ} (s, a) + \max_{a' \in A} N_{t} (s', a) - \max_{a \in A} N_{t} (s, a)] \end{matrix}

(12)

where $r_{Θ} (s, s')$ is the $U (\cdot)$ from the state s to the state $s'$ .

Step 8. If all QoS schedulers reach the SE status, the game process is temporarily stopped. The SE status is formally defined as follows

\begin{matrix} SE = {P = {P^{i} (S' | S, A_{j}^{i} (t))}_{i \in N, A_{j}^{i} \in A_{i}} | M} \\ s . t ., M = \max_{i, A_{j}^{i}} (P^{i} (S' | S, A_{j}^{i} (t)) - P^{i} (S' | S, A_{j}^{i} (t - 1))) \leq ε \end{matrix}

(13)

Step 9. Constantly, each QoS scheduler is self-monitoring the current IoT situation. If the current system status is not the SE, it proceeds to Step 2 for the next iteration.

Table 1.

System parameters used in the simulation experiments.

Traffic class	Message application	Bandwidth requirement (Kbps)	Connection duration average
I	Delay-critical applications	32	30 s (0.5 min)
	Event-related applications	32	120 s (2 min)
	Event-related applications	64	180 s (3 min)
II	General applications	128	120 s (2 min)
	General applications	256	180 s (3 min)
	Multimedia applications	384	300 s (5 min)
	Multimedia applications	512	120 s (2 min)
Parameter	Value	Description
$σ$	0.7	A weighted control parameter for the final propensity
\|N\|	4	The number of QoS schedulers
W	256 Kbps	Allocated bandwidth for service
Ω	1	Gap between uncoded M-QAM and the capacity, minus the coding gain
$ε$	0.2	Pre-defined minimum bound for stable status
m	3	The number of power levels for QoS schedulers
$β$	0.3	A weighted control parameter for the R-learning algorithm
$α$	0.3	A weighted control parameter for the R-learning algorithm

QoS: quality of service; M-QAM: M-ary quadrature amplitude modulation.

Performance evaluation

In this section, we compare the performance of the proposed scheme with other existing schemes. As mentioned in section “Introduction,” we select the QDDC scheme¹⁷ and the DSEE scheme¹⁸ and can confirm the performance superiority of our approach through the simulation analysis. The QDDC and DSEE schemes have been recently published and introduced unique challenges to efficiently solve the system control problems. The assumptions of our simulation environment are as follows:

The simulated system consists of four QoS schedulers for a IoT system.

In each scheduler covering area, a new service request is Poisson with rate $ρ$ (services/s), and the range of offered service load was varied from 0 to 3.0.

The number of power levels (m) for QoS schedulers is three, and each strategy $(A_{j, 1 \leq j \leq m})$ is $A_{j} \in {50 mW, 75 mW, 100 mW}$ .

System performance measures obtained on the basis of 100 simulation runs are plotted as a function of the offered traffic load.

The message size of each application is exponentially distributed with different means for different message applications.

For simplicity, we assume the absence of physical obstacles in the experiments.

To facilitate the development and implementation of our simulator, Table 1 lists the system parameters.

In this section, the performance of the proposed scheme is compared with two existing schemes: the QDDC scheme¹⁷ and the DSEE scheme.¹⁸ Even though these existing schemes are recently published novel protocols, there are several disadvantages. First, these existing schemes rely on the impractical assumption for real operations; inapplicable presumption can cause potential erroneous decisions. Second, they cause the extra control overhead. The increased overhead can exhaust the system resources and need intractable computation. Third, these schemes cannot adaptively estimate the current system conditions. Fourth, these schemes operate the system by some fixed system parameters. Under dynamic real-world environments, it is an inappropriate approach to operate the IoT system.

Performance measures obtained through the simulation are IoT system throughput, service success probability, normalized service delay, system power stability, and application incomplete ratio. In Figures 1 –5, the x-axis (a horizontal line) marks the service load intensities, which is varied from 0 to 3.0. Based on each rate of offered service load, performance criteria are evaluated as a normalized value; y-axis (a vertical line) represents the normalized value for each performance criteria.

Figure 1.

IoT system throughput.

Figure 2.

IoT service success probability.

Figure 3.

Normalized service delay in IoT systems.

Figure 4.

System power stability.

Figure 5.

Application incomplete ratio in IoT system.

Figure 1 shows the performance comparison of each scheme in terms of IoT system throughput. In this work, the IoT system throughput is measured as the normalized number of information bits that are transmitted without error per unit-time. Traditionally, it is one of the most critical aspects of the IoT management. The proposed TG-based approach adaptively decides power levels in an interactive-cooperative manner while monitoring the current system conditions. Therefore, the system throughput of the proposed scheme is better than the other schemes.

Figure 2 represents the service success probability of each IoT control scheme. In this work, service success probability is defined as the success ratio of service requests. In general, the excellent service success rate is a highly desirable property for actual IoT operations. As the offered traffic load increases, excessive service requests may lead to system congestion. Therefore, the service success probability decreases. This is intuitively correct. Under various application service requests, our game-based R-learning approach effectively handles the power control problem in IoT systems and leads to a better service success probability than other existing scheme.

Figure 3 reveals the normalized service delay in IoT systems. Usually, service delay is an important QoS metric and is able to reveal the fitness or unfitness of system protocols for different delay-sensitive applications. Due to the feedback-based repeated game approach, our proposed scheme can dynamically adapt the current situation and has much better accuracy than other existing schemes.

Figure 4 indicates the IoT system power stability of each scheme. In this study, the system power stability meant that the ratio of the actual power changes to the total power control periods. All the schemes have similar trends. However, our docitive paradigm–based power control policy can make the IoT network system more stable. Therefore, the proposed scheme can maintain the steady state under various network load intensities.

The curves in Figure 5 present the application incomplete ratio in the IoT System. As the offered traffic load increases, the IoT system will run out of the capacity for application service operations. Therefore, requested applications are likely to fail to meet the minimum QoS provisioning. Therefore, the application incomplete ratio increases linearly with the traffic load. From low to high traffic load intensities, the proposed scheme achieves a lower application incomplete ratio than other schemes.

The performance trends presented in Figures 1 –5 are very similar. However, using the TG-based R-learning mechanism, the proposed scheme is flexible, adaptable, and able to sense the dynamic changing IoT system environment; it is essential in order to be close to the optimized system performance. Under diversified IoT traffic condition changes, the simulation result of proposed scheme is much better than the other schemes. Especially, the IoT system throughput, service success probability, normalized service delay, system power stability, and application incomplete ratio are improved by about 5%, 5%, 10%, 20%, and 10%, respectively, than the existing QDDC and DSEE schemes.^17,18

Summary and conclusion

The IoT is emerging as one of the major trends shaping the development of technologies in the Internet paradigm. The IoT technology has been evolving due to a convergence of multiple technologies, ranging from wireless communication to the Internet and from embedded systems to micro-electromechanical systems. However, the diversity of applications causes the QoS control problem in IoT platform. This study provides a novel QoS control algorithm for IoT systems. Based on the R-learning algorithm and docitive paradigm, we develop a new TG model. In the proposed game model, QoS schedulers iteratively observe the current IoT system conditions and adaptively change their power levels to maximize the system performance. Due to the self-regarding feature, these control decisions are made by an entirely distributed fashion. In actual IoT system operations, the distributed learning approach is suitable for ultimate practical implementation. Compared with the existing schemes, the simulation results showed that our proposed scheme effectively handles the current IoT system to achieve the better benefit.

Footnotes

Academic Editor: Poh Chong

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the MSIP (Ministry of Science, ICT, and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2016-H8501-16-1018) supervised by the IITP (Institute for Information & communications Technology Promotion) and was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2015R1D1A1A01060835).

References

Zhang

Peng

. Intelligent decision-making service framework based on QoS model in the internet of things. In: Proceedings of the international symposium on distributed computing and applications to business, engineering & science 2012, Guilin, China, 19–22 October 2012, pp.103–107. New York: IEEE.

Jang

Pyeon

Kim

. A survey on communication protocols for wireless sensor networks. J Comput Sci Eng 7(4): 231–241.

Zhao

QoS-aware scheduling of services-oriented internet of things. IEEE T Ind Inform 2014; 10(2): 1497–1505.

Duan

Chen

Xing

. A QoS architecture for IOT. In: Proceedings of the IEEE international conference on internet of things and 4th international conference on cyber, physical and social computing 2011, Dalian, China, 19–22 October 2011, pp.717–720. New York: IEEE.

Lin

P-C

Cheng

R-G

Liao

L-H.

Performance analysis of two-level QoS scheduler for wireless backhaul networks. IEEE T Veh Technol 2012; 61(3): 1361–1371.

Kim

Uno

Kim

Adaptive QoS mechanism for wireless mobile network. J Comput Sci Eng 2010; 4(2): 153–172.

Kim

Game theory applications in network design. Hershey, PA: IGI Global, 2014.

Vrancx

Verbeeck

Nowé

Decentralized learning in Markov games. IEEE T Syst Man Cy B 2008; 38(4): 976–981.

Jiang

Chen

Yang

Y-H

. Dynamic Chinese restaurant game: theory and application to cognitive radio networks. IEEE T Wirel Commun 2014; 13(4): 1960–1973.

10.

Martin

Tilak

On ε-optimality of the pursuit learning algorithm. J Appl Prob 2012; 49(3): 795–805.

11.

Yang

A survey of transfer and multitask learning in bioinformatics. JCSE 2011; 5(3): 257–268.

12.

Minovic

Stavljanin

Milovanovic

. User-centered design of m-learning system: moodle on the go. J Comput Sci Eng 2010; 4(1): 80–95.

13.

Schwartz

. A reinforcement learning method for maximizing undiscounted rewards. In: Proceedings of the tenth international conference on machine learning, Amherst, MA, 27–29 June 1993, pp.298–305.

14.

Mahadevan

Average reward reinforcement learning: foundations, algorithms, and empirical results. Mach Learn 1996; 22(1–3): 159–195.

15.

Giupponi

Galindo-Serrano

Blasco

. Docitive networks: an emerging paradigm for dynamic spectrum management. IEEE Wirel Commun 2010; 17(4): 47–54.

16.

Blasco

Giupponi

Galindo-Serrano

. Energy benefits of cooperative docitive over cognitive networks. In: Proceedings of the IEEE wireless technology conference, Paris, 27–28 September 2010, pp.109–112. New York: IEEE.

17.

Dai

Liu

Wang

QoS-based device-to-device communication schemes in heterogeneous wireless networks. IET Commun 2015; 9(3): 335–341.

18.

Vakili

Liu

Zhao

Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE J Sel Top Signa 2013; 7(5): 759–767.

19.

Jiang

Chen

Gao

. Indian buffet game with negative network externality and non-Bayesian social learning. IEEE T Syst Man Cyb 2015; 45(4): 609–623.

20.

Liu

Zhao

Learning in a changing world: restless multiarmed bandit with unknown dynamics. IEEE T Inform Theory 2013; 59(3): 1902–1916.

21.

Galindo-Serrano

Giupponi

Distributed Q-learning for aggregated interference control in cognitive radio networks. IEEE T Veh Technol 2010; 59(4): 1823–1834.

22.

Van der Wal

. Discounted Markov games: successive approximation and stopping times. Int J Game Theory 1977; 6(1): 11–22.