An actor-critic-based portfolio investment method inspired by benefit-risk optimization

Abstract

How to get maximal benefit within a range of risk in securities market is a very interesting and widely concerned issue. Meanwhile, as there are many complex factors that affect securities’ activity, such as the risk and uncertainty of the benefit, it is very difficult to establish an appropriate model for investment. Aiming at solving the curse of dimension and model disaster caused by the problem, we use the approximate dynamic programming to set up a Markov decision model for the multi-time segment portfolio with transaction cost. A model-based actor-critic algorithm under uncertain environment is proposed, where the optimal value function is obtained by iteration on the basis of the constrained risk range and a limited number of funds, and the optimal investment of each period is solved by using the dynamic planning of limited number of fund ratio. The experiment indicated that the algorithm could get a stable investment, and the income could grow steadily.

Keywords

Actor critic investment operations optimization stock reinforcement learning

Introduction

The portfolio studies how to get the maximal expected profit controlled by a predefined risk range, or how to minimize the expected risk within a range of expected profit.^1,2 For the general portfolio problem, an individual or a business makes a reasonable investment through the price, expense, and the resulting change in the amount of money, so that a limited number of funds are utilized in an optimal way. However, many uncertain factors would affect the portfolio. And moreover, the utility function is generally non-convex.³ From the view of the long term, many conventional methods are hard to get the best allocation strategy, most of which only attaining the local optimal solution rather than the global optimal solution. It has attracted a lot of concerns about how to establish a reasonable model for the general investment problem and find a global optimal method, and has become a hot topic in recent years.

Portfolio investment is an effective way to control investment risk and attain profit. Some efforts proved that the portfolio investment could effectively reduce the risk by the mean-variance analysis,⁴ and by using the proposed portfolio effective boundary model, the stock investors could find out the stocks of the lowest level of risks with the same rate of profit.⁵ However, as the investment is often affected by many sophisticated factors, such as economic, social, and investor subjective attributes, which keep constantly changing over time, the so-called optimal solution by some fixed learned investment model which uses multiple investments reduce risk to get as much revenue as possible by document analysis is usually not the one that best fits the situation at the time.

Most portfolio models assumed that the profit on securities was subject to a normal distribution, and all investors were within a single investment period. The variance was used to assess the investment risk. Investors chose to invest according to the expected and variance of the yield combination. However, in the real world securities market, the rate of profit on securities does not necessarily obey the normal distribution, and the investors have their own risk aversion and risk appetite. What’s more, the investment is often influenced by uncertain factors. As a result, the difficulty of investment increased. The newly emerged reasonable models, which are classified as modern portfolio, are able to solve the problem in a better way, with the goal of optimal combination scale and the investment ratio. In a combination of investment, more portfolio investment increases the risk of the entire combination, but decreases the fee of investment, and the risk no longer fails when the number of assets in the portfolio reaches a certain number.The risk of investment combination is characterized by the covariance between the assets of the portfolio. Under specified conditions, there is a set of investment ratios that minimize the combined risk.

Most conventional portfolio investment models frequently first divide a single stage problem into a multi-stage one, and then use dynamic programming methods to solve the problem. With the development of the modern financial market, there are more and more types of investments, and as a result, the data generated by the multi-stage problem and the investment finance in the transaction cost constitute the curse of dimensionality. Moreover, the multi-stage portfolio investment is in fact a nonlinear dynamic stochastic process, which especially needs to consider the uncertain factors of each period.

In practice, dynamic programming has numerous limitations in dealing with portfolio investment problems. First, the fundamental inputs required for the model contain the variance of the current state, and the covariance between the two groups. The great number of stock combinations causes the curve of dimension as there are too many states. Secondly, data often contain errors that are caused by unreliability and uncertainty of the optimal results. The input data into the model are usually estimated by the previous data. If the data were estimated to be free from errors, the conventional model guarantees a valid portfolio, but as normally the expected data are unknown and require statistical estimation, it is very likely for the estimated input data to contain bias or even errors, which often result in incorrect product investment ratio. Thirdly, in the presence of investment transaction costs, an uncertain random information may affect the optimal solution, and cause the instability of the solution. Moreover, in the long-term investment process, if the investment cycle is within a certain period of duration, the adjustment of proportion of assets in the portfolio investment will lead to increasing transaction costs.

Many models, such as the mean-variance model,⁶ the capital asset pricing model,⁷ and the Black–Scholes option pricing model,⁸ assume that the market is free of friction, that is, there are no tax and transaction costs. However, most securities markets are very complex, not only due to the existence of transaction costs, but also due to the liquidity and turnover, along with the correlation between the stocks. All these factors make the problem harder to be solved.

In this paper, aiming at solving the above-stated problems and related model defects generated by dynamic programming, we consider using approximate dynamic programming to establish a reasonable Markov Decision Process (MDP) model⁹ for the portfolio investment problem. Under various constraints, we designed different decision-making goals for different targets, combined with actor-critics algorithm and piecewise linear function approximation method to solve the optimal investment policy, so as to achieve a certain risk within a stable investment.

MDP

MDP model

An MDP model can be denoted by a quadruple 〈S, A, f, ρ〉, where S is the state set, A is the action set, f: S × A→S is the state transition function, ρ: S × A→R is the reward function.^10–12 The reward value r_t+1 of time t is

r_{t} {+ 1}_{=} ρ (s_{t}, a_{t})

(1)

where s_t is the state and a_t is the action. The action a_t in state s_t is determined by the policy π which is learned by the algorithm and decides the action of the agent.

According to the policy π, the state s_t transfers to the next state s_t₊₁

s_{t} {+ 1}_{=} f (s_{t}, π (s_{t}))

(2)

The reward function is

r_{t} {+ 1}_{=} r (s_{t}) = ρ (s_{t}, π (s_{t}))

(3)

Given an infinite state sequence {s₀, s₁, … , s_n, …}, the total amount of the discount reward received is

R^{π} (s) = \sum_{t = 0}^{\infty} γ^{t} r (s_{t}) = \sum_{t = 0}^{\infty} γ^{t} ρ (s_{t}, π (s_{t}))

(4)

where s is the state and

γ

is the discount rate.

However, in practice, the steps are finite and it generally has a termination step which only represents the termination and is not involved in the computing reward. The state sequence {s₀, s₁, … , s_n, …} is usually denoted by {s₀, s₁, … , s_n, s_n+1}, where the state s_i (0 ≤ i ≤ n) is the usual state and the state s_n₊₁ represents the termination state, and correspondingly equation (4) can be redefined as

R^{π} (s) = \sum_{t = 0}^{n} γ^{t} r (s_{t}) = \sum_{t = 0}^{n} γ^{t} ρ (s_{t}, π (s_{t}))

(5)

Reinforcement learning

Reinforcement learning algorithms usually evaluate the policy π using the action value function Q^π(s, a) and the state value function V^π(s).^13,14 The action value function Q^π(s, a) refers to the cumulative reward obtained by taking the action a under the state of the s and the policy π, and the state value function V^π(s) is the cumulative reward from the specific under the policy π.

According to the definition, the action value function¹⁵ is calculated as

Q^{π} (s, a) = ρ (s, a) + γ R^{π} (s^{'})

(6)

where state s′ is the subsequent state of state s

According to the policy π that is learned by the algorithm, taking the action a, the subsequent state of state s can get by transition function f

s' = f (s, a)

(7)

We can get

Q^{π} (s, a) = ρ (s, a) + γ R^{π} (f (s, a))

(8)

Given a state action sequence {(s₀, a₀),(s₁, a₁), … , (s_n, a_n), (s_n+1, a_n+1)}, where the state action pair (s_i, a_i) (0 ≤ i ≤ n) represents taking action a_i in state s_i, and the state action pair (s_n+1, a_n+1) only represents the termination state and action which is not used for computing the reward. We can get a discount reward for Q^π(s, a) by state action sequence {(s₀, a₀),(s₁, a₁), … , (s_n, a_n)}

\begin{array}{l} Q^{π} (s_{0}, a_{0}) = ρ (s_{0}, a_{0}) + R^{π} (s_{1}) \\ = ρ (s_{0}, a_{0}) + \sum_{t = 1}^{n} γ^{t} ρ (s_{t}, π (s_{t})) \end{array} \begin{array}{l} = ρ (s_{0}, a_{0}) + \sum_{t = 1}^{n} γ^{t} ρ (s_{t}, a_{t}) \\ = ρ (s_{0}, a_{0}) + γ \sum_{t = 1}^{n} γ^{t - 1} ρ (s_{t}, a_{t}) \\ = ρ (s_{0}, a_{0}) + γ ρ (s_{1}, a_{1}) + γ \sum_{t = 2}^{n} γ^{t - 1} ρ (s_{t}, a_{t}) \\ = ρ (s_{0}, a_{0}) + γ [ρ (s_{t}, a_{t}) + \sum_{t = 2}^{n} γ^{t-1} ρ (s_{t}, a_{t})] \\ = ρ (s_{0}, a_{0}) + γ [ρ (s_{t}, a_{t}) + \sum_{i = 2}^{n} γ^{i} ρ (s_{i}, a_{i})] \end{array}

(9)

where Q^π(s₁, a₁) is calculated by

ρ (s_{1}, a_{1}) + \sum_{i = 2}^{n} γ^{i} ρ (s_{i}, a_{i})

(10)

Then

Q^{π} (s_{0}, a_{0}) = ρ (s_{0}, a_{0}) + γ Q^{π} (s_{1}, a_{1})

(11)

Therefore, we can get the general form of Q^π(s, a)

Q^{π} (s, a) = ρ (s, a) + γ Q^{π} (s^{'}, a)

(12)

where s′ is successive state of the state s.

The state value function is

V^{π} (s) = R^{π} (s)

(13)

Similarly, we can get the discounted reward of state function V^π(s)

V^{π} (s) = ρ (s) + V^{π} (s^{'})

(14)

where s′ is successive state of the state s.

In the undeterministic environment, there are two aspects of uncertainty that should be considered. First, the successive state can be random and cannot be determined by the current state and the selected action; second, the reward could also be uncertain. The corresponding action value function Q^π(s, a) and the state value function V^π(s) need to be changed.

In the MDP model of the undeterministic environment, the state transition function f is determined using the state transition probability function: S × A×S → [0,1] instead of determining the state transition function f in the environment. In this way, according to the probability of the strategy π, at time t, with the state s_t and the action a_t, the state transitions to the successive state s_t₊₁ ∈ S_t₊₁ are

p (s_{t + 1} \in S_{t + 1} | s_{t}, a_{t}) = \int_{S_{t + 1}} \tilde{f} (s_{t}, a_{t}, s^{'}) d s^{'}

(15)

where

S_{t + 1} \subseteq S

denotes the set of all possible successive states s_t₊₁ ∈ S_t₊₁ at time t. Similar to the representation of the reward value, the state transition probability function can be denoted as p^π(x_t₊₁|x_t) under policy π.

First, consider uncertainty in state transfer. After taking the action a_t in the state s_t, and transfers to the state s_t₊₁ with a transition probability rather than directly transferring to the state s_t₊₁.

\begin{matrix} \tilde{ρ} (s_{t}, a) = p (s_{t + 1} | s_{t}, a) r_{t + 1} \\ = r_{t + 1} \int_{S_{t + 1}} \tilde{f} (s_{t}, a_{t}, s^{'}) d s^{'} \end{matrix}

(16)

In an undeterministic environment, the state s obeys the distribution •, according to the policy π, the discounted reward from a certain starting state s₀ is

R^{π} (s_{0}) = E_{s_{t + 1} \sim \tilde{f} (s_{t}, π (s_{t}), •)} {\sum_{t = 0}^{n} γ^{t} \tilde{ρ} (s_{t}, π (s_{t}))}

(17)

where E stands for expectation, π(s_t) is the action of state s_t under policy π,

\tilde{f}

is the state transition probability function from state s_t to successive state s_t₊₁, and

s_{t + 1} \sim \tilde{f} (s_{t}, π (s_{t}), •)

denotes extracting successive state s_t₊₁ from the distribution •.

Therefore, the general form of Q^π under the uncertain environment is

Q^{π} (s, a) = E_{s^{'} \sim \tilde{f} (s, π (s), •)} {\tilde{ρ} (s, π (s)) + γ Q^{π} (s^{'}, a)}

(18)

where s′ is successive state of the state s.

Second, consider uncertainty in reward. Taking into consideration of uncertainty of the action taken means that the action corresponding to the policy π can be any of the action of the action set. The undeterministic environment also needs to consider the instability of the reward value and the reward value also multiplies a probability. In this way, the reward value of the unstable environment represented by the left part of equation (18) becomes

\tilde{r} (s, a) \underset{\int}{=} p^{π} (r | s, a) r d r

(19)

where p^π(r|s, a) represents the probability that the action value a is obtained by taking the action a in the state s according to the policy π.

Correspondingly, the right part of equation (18), the Q^π, is also changed as

Q^{π} (s^{'}, a) \underset{\int}{=} p^{π} (s^{'} | a^{'}) Q^{π} (s^{'}, a^{'}) d a^{'}

(20)

where p^π(r|s, a) is the probability of taking action a′ under state s′ according to policy π.

Therefore, Q^π in undeterministic environment is

Q^{π} (s, a) \underset{\int}{=} p^{π} (r | s, a) r d r \underset{\int}{+} p^{π} (a^{'} | s^{'}) Q^{π} (s^{'}, a^{'}) d a^{'}

(21)

Similarly, V^π in an unstable environment is

V^{π} (s) \underset{\int}{=} p^{π} (r | s) r d r + \int p^{π} (s^{'} | s) V^{π} (s^{'}) d s^{'}

(22)

In the reinforcement learning, the policy that maximizes the expected accumulative reward, return, is called the optimal strategy π*, and the corresponding optimal action value function is Q* (s, a), and the optimal state value function is V^*(s). Therefore, for any strategy π and the state action pair (s, a), there exists Q^* (s, a) ≥ Q^π (s, a); for any policy π and state, there exists V^*(s) ≥V^π(s). Although a reinforcement learning problem may have multiple optimal strategies at the same time, the optimal action value function or the optimal state value function is unique, and is updated by

\begin{array}{l} Q^{*} (s, a) \underset{\int}{=} p^{π} (r | s, a) r d r \\ + \underset{\int}{γ} p^{π} (a^{'} | s^{'}) max_{a^{'}} Q^{π} (s^{'}, a^{'}) d a^{'} \end{array}

(23)

V^{*} (s) \underset{\int}{=} p^{π} (r | s) r d r + γ \max_{u} \int p^{π} (s^{'} | s) V^{*} (s^{'}) d s^{'}

(24)

Actor-critic model

Different from value function-based reinforcement learning methods, the actor-critic algorithm has two independent structures, one for storing and updating the value function and the other for storing the updated policy.^16,17 The agent selects the action according to the policy rather than the value function, where the policy part is called the actor, which performs an action, updates the value of the function, and makes use of value function to evaluate the action, and the value function part is called critic. The value function of the critic part can also use temporal difference error (TD error) which is calculated by TD learning method, e.g. Q-learning, that is able to learn directly from raw experience without a model of the system and is thus suitable for decision of dynamic and uncertain system. The framework of the actor-critic algorithm is shown in Figure 1.

Figure 1.

An illustration diagram for framework of actor-critic algorithm.

The advantage of the actor-critic algorithm is to separate the policy from the value function, using linear approximation to learn the value function and the policy function, where the critic part is the value function approximator,¹⁸ learning the estimate function, and then passed to the actor part. The actor part is a policy approximator, which learns a random strategy and uses the gradient-based policy update method to select the action. Then critic part uses the time difference algorithm to estimate each state value function caused by the action in the actor policy iteration process, using the estimated value function to evaluate the action selected by the actor to find the maximum value of the local or overall cumulated reward to provide more effective reinforcement feedback signal to the actor, and update the actor policy according to the random gradient. The actor-critic algorithm is much simpler than Q-learning in the computation,¹⁹ can determine the optimal policy, and effectively applied to control the tasks.^20,21

Markov chain model for investment

Prerequisites for the model

Given an investor with available cash C – a total of n kinds of investment, including stocks, funds, bonds, foreign exchange, etc. – and the maximal loss risk m_r, the goal of the investment is to get the most benefit with the least risk, the maximum benefit-risk ratio. If the benefit in a certain period of duration is treated as a dynamic process, we can establish a Markov chain model to solve the problem. The model describes the state of the portfolio investment, including stock code of holding, the number of holdings, the expected rate of return, risk score, and average turnover rate of each stocks in the portfolio investment. It can be represent by a vector

p o r t f o l i o i n v e s t m e n t = [s t o c k_{1}, \dots, s t o c k_{n}] s t o c k = [s t o c k c o d e, h o l d i n g n u m b e r, e x p e c t e d r e w a r d, r i s k f a c t o r, t u r n o v e r r a t e]

Given the investment transfer function Ø, the value of Ø is the probability of the transfer of funds. In particular, the value of f being 0 means that there is no transfer of funds and maintaining the holding of the investment; as a result, the investment at the next stage is in consistent with that the investment at the current stage. The value of Ø being 1 means that there is a transfer of funds and changing the holding of the investment; as a result, the investment of the next stage is completely inconsistent with the investment at the current moment.

φ (J) = \{\begin{matrix} 0 & J = J^{'} \\ 0 & J \neq J^{'} \end{matrix}

(25)

where

φ (J) = 0

denotes no funds transfer and keeping the stock or the principal unchanged,

φ (J) = 1

denotes funds transfer, J is the funding of current stage, and J′ the funding of the next current stage. The Q(s_t, a) is the state vector for decision

Q (s_{t}, a) = π (Q (s_{t}, a_{1, i}) = a_{2, i})

(26)

Given a stock, if the agent decides to increase investment holdings, r(s_t, a) is positive; otherwise, if the agent determines to decrease investment holdings, r(s_t, a) is negative. And we use post-decision state vectors for decision

R (s_{t}, a) = R (s_{t - 1}, a^{'}) + r (s_{t}, a)

(27)

Considering the transaction costs, we define the cost of holding of financial products is 0, denoted by c_a1= 0; in the state t, transferring from J financial product to another financial products needs cost, c_a1x_t,a1.

cost (t) = \{\begin{cases} 0 & J = J^{'} \\ φ (J) x (s_{t}) & J \neq J^{'} \end{cases}

(28)

where J is the funding of current stage, J′ the funding of the next current stage, and

x (s_{t})

is the predefined transfer fee rate at stage

s_{t}

which can be set as a constant.

Therefore, the transaction cost during the investment period is

cost (R) = \sum_{i = 0}^{T} cost (i)

(29)

When the value of the asset changes, the relative return of the market is defined as $\tilde{r} (s_{t}, a)$ , and then the relative return of a financial product is

\tilde{R} (s_{t}, a) = R (s_{t - 1}, a) + \tilde{r} (s_{t}, a)

(30)

From state $R (s_{t - 1}, a)$ to $R (s_{t}, a)$ , then the total amount of resources is

\bar{R} (s_{t}) = \sum_{a} R (s_{t - 1}, a)

(31)

The initial stage of R is assigned as M. The portfolio investment problem contains a variety of stochastic information, such as individual stocks and portfolio of expected rate of return, risk factors, turnover rate, which have increased the difficulty of solving the problem; the yield and risk factors are estimated as follows.

Expected returns

Given a stock pool sp, $s p \in {0 � \dots, N}$ , sp = 0 means that no stock is invested and only the bank interest is obtained through the principal, and the expected return rate of the i-th stock at stage t with state s_t is denoted by $s p (s_{t}, s t o c k_{i})$ . $\bar{s p} (s_{t}, s t o c k_{i})$ is the return rate of the i-th stock of each trading day. The action is defined as selection of stocks and holding the share stocks, which is represented by the i-th portfolio, $p o_{i}$ . The state is defined as stocks and shares of time t. The expected reward of the i-th portfolio at stage t with state s_t is

\tilde{r} (s_{t}, p o_{i}) = \sum_{j = 1}^{n} ω (s_{t}, s t o c k_{j}) \bar{s p} (s_{t}, s t o c k_{j})

(32)

where

p o_{i}

is the i-th portfolio, n is the total number of stock

s p (s_{t}, s t o c k_{j})

ω (s_{t}, s t o c k_{j})

is the weight of the single stock in stage t, which represents the proportion of the j-th stock in the combination

ω (s_{t}, s t o c k_{j}) = \frac{\tilde{r} (s_{t}, s t o c k_{j})}{\sum_{k = 1}^{n} \tilde{r} (s_{t}, s t o c k_{k})}

(33)

where the sum of all weights is 1.

According to equation (30), the expected return of the i-th portfolio at stage t with state s_t is

\tilde{R} (s_{t}, a_{i, t}) = R (s_{t - 1}, a_{i, t - 1}) + \tilde{r} (s_{t}, a_{i, t})

(34)

The update of the weight of the i-th stock in the t-th stage is

\begin{array}{l} \begin{matrix} ω (s_{t}, s t o c k_{i}) = \frac{\tilde{R} (s_{t}, a_{i, t}) - R (s_{t - 1}, a_{i, t - 1})}{\sum_{j = 1}^{n} \tilde{r} (s_{t}, s t o c k_{j})} \end{matrix} \\ = \frac{\tilde{R} (s_{t}, a_{i, t}) - R (s_{t - 1}, a_{i, t - 1})}{\tilde{R}} \end{array}

(35)

Risk factor

The risk factor of a stock is

r i s k_{s} (s_{t}, s t o c k_{i}) = \frac{\sqrt{{\sum_{k = 1}^{N} (s p (s_{k}, s t o c k_{i}) - \bar{s p})}^{2}}}{N}

(36)

where N is the total time step,

s p (s_{t}, s t o c k_{i})

is the expected return rate of the i-th stock at stage k with state s_t and.

\bar{s p} (s_{t}, s t o c k_{i})

is the return rate of the i-th stock of each trading day.

And the risk factor of a portfolio is

r i s k_{p} (s_{t}, p o_{i}) = \frac{\sqrt{{\sum_{k = 1}^{N} (\tilde{r} (s_{k}, p o_{i}) - \bar{r})}^{2}}}{N}

(37)

where N is the total time step and

\bar{r}

is the average of

\tilde{r}

during N time steps.

In a portfolio investment, the smaller the correlation between the selected stocks results in the smaller the impact on each other, and the smaller the risk factor, where the correlation is reflected through the correlation coefficient between the two shares of the stock, denoted by

ϑ (i, j) = \frac{cov (i, j)}{\sqrt{D_{i}} \sqrt{D_{j}}}

(38)

where

ϑ (i, j)

is the correlation coefficient between the stock i and the stock j, and

| ϑ (i, j) | \leq 1

. The

ϑ (i, j)

that is closer to 0 indicates that the smaller the correlation coefficient between the stocks, the smaller the risk; the

ϑ (i, j)

that is closer to 1 indicates the greater the correlation coefficient, the greater the risk. It is better to choose a combination with a correlation coefficient close to zero to minimize the total risk factor.

Value function

We use V(R) to represent the total value of the securities R, $V (R (s_{t}))$ to denote the value function at time t, and $V (R (s_{t}, a))$ to represent the value function of the state vector after the decision. The investor's goal is to maximize the total return on long-term investments, which is based on the assumption that investors tend to minimize asset losses at each investment stage.

From the above analysis, we can see in a portfolio investment that a greater risk factor has a smaller turnover rate, that a higher yield has a greater the weight, and that a smaller correlation of any two stocks has a smaller risk factor. Since the investor randomly chooses the same type of investment, it is hard to attain the exact value function at each stage. So it is necessary to get approximate solution by assigning different weights to different financial products. The approximate value function is denoted by $\tilde{V} (R (s_{t}))$ .

Here we use a utility function to measure the benefit by

B (\bar{R} (s_{t})) = {(\frac{\bar{R} (s_{t})}{\bar{R} (s_{t - 1})})}^{1 - r i s k_{p}}

(39)

where risk_p is the coefficient that controls the risk aversion, and

\bar{R} (s_{t})

is the constant at time t.

Investors can receive a number of possible portfolio based on a certain range of risk and rate of return, and obtain the maximum utility expectations of the portfolio, which is called the optimal portfolio investment. The convexity of the indifference curve and the convexity of the effective set determine that the optimal portfolio is unique. The objective function can be written as

\tilde{V} (R (s_{t})) = \max_{r} (B (\bar{R} (s_{t})) + γ E {\bar{V} (R (s_{t - 1}))})

(40)

where E is expectation and is calculated by

\begin{array}{l} E {\bar{V} (R (s_{t - 1}))} = \sum_{i = 1}^{n} p_{t - 1, a_{i}} ω_{i, t - 1} R (s_{t - 1}) \\ \times (1 + r (t - 1)) - cost (R (s_{t - 1})) \end{array}

(41)

Let

\frac{\partial \max (γ \bar{V} (R (s_{t})))}{\partial R (s_{t})} = \hat{ζ} (s_{t}) and \frac{\partial B (\bar{R} (s_{t}))}{\partial R (s_{t})} = \frac{1 - r i s k_{p}}{\bar{R} (s_{t - 1})} {(\frac{\bar{R} (s_{t})}{\bar{R} (s_{t - 1})})}^{- r i s k_{p}}

we get

\frac{\partial \tilde{V} (R (s_{t}))}{\partial R (s_{t - 1})} = \sum_{a^{'}} \frac{\partial \tilde{V} (R (s_{t}))}{\partial R (s_{t}, a^{'})} \frac{\partial R (s_{t}, a^{'})}{\partial R (s_{t - 1})}

(42)

where

\frac{\partial R (s_{t}, a^{'})}{\partial R (s_{t - 1})} = {\begin{matrix} \hat{ρ} (s_{t}, a) \\ 0 \end{matrix} \begin{matrix} a^{'} = a \\ a^{'} \neq a \end{matrix}

, such that

\frac{\partial \tilde{V} (R (s_{t}))}{\partial R (s_{t - 1})} = (\frac{\partial B (\bar{R} (s_{t}))}{\partial R (s_{t})} + \hat{ζ} (s_{t})) \hat{ρ} (s_{t}, a)

(43)

Then the optimal objective function is

Q^{*} (R) = \max {V (R) + T (R) - cost (R)}

(44)

where

T (R) = \frac{t * (M - R) * r_{0}}{12}

Algorithm description

We use softmax as the policy

π = \arg \max \frac{a_{i}}{\sum_{a \in A} a}

(45)

The investment rate of return on the stock is sorted in descending order. The stock with the return that is greater than the current stock's maximum profit with least risk by considering risk factor is selected. The available policy is within a certain range, and those policies that exceed the range of risk factors will be discarded.

The actor-critic based algorithm of optimal portfolio investment under uncertain environment is as follows

Actor-critic based algorithm of optimal portfolio investment under undeterministic environment

Input: PI , C, M₀

PI: state of the portfolio investment, including stock code of holding, the number of holdings, the expected rate of return, risk score, and average turnover rate of each stocks in the portfolio investment.

C: the transaction cost rate.

M₀:the total amount of funds at initial step.

Output: w, M_t, X_n

w: the weight of the single stock.

M_t: the total amount of funds at stept.

X_t: investment at time step t.

(1) initialize action set U

(2) set threshold for deviation ε

(4) initialize time step t as 0

(5) initialize expected benefit rate r

(6) initialize risk factor risk_t, expected benefit rate r at time step t

(7) $t \leftarrow t + 1$

(8) compute expected benefits as investment times expected benefit rate at time step t: $f_{t} \leftarrow M_{t} * r_{t}$

(9) compute value function by equation (44)

(10) get current deviation at time step t : $Δ_{t} \leftarrow f_{t} - c (\sum_{i = 1}^{n} | s t o c k_{i} |)$

(11) repeat

(12) if $Δ_{t} \leq <$ then

(13) repeat

(14) update r_t, risk_t

(15) go to step (7)

(16) until $r_{t} > r$ and $β_{t} < β$

(17) else

(18) transfer state $V_{t} (R) \leftarrow V_{t - 1} (R) + Δ_{t}$

(19) until satisfy stop condition (such as loop times or $|V_{t} (R) - V_{t - 1} (R)|$ is less than a predefined threshold for a long time)

(20) get optimal investment : $V^{*} (R) \leftarrow m a x \{V (R) + T (R) - c o s t (R)\}$

(21) return w, M_t, X_t

Experiment

The problem assumes that the risk matrix is based on historical data and follows the law of the single stock income-risk in the positive direction.²² There are three principles that are used to guide the investor for selecting the stock. First, the net income matrix has three important elements: the type of stock, net income and the amount of money invested. Second, the choice between the stock is not relevant, and it is up to the investor to decide the type and the holding number of stock. Third, total earnings include interest from the bank whose interest rate is predetermined, transaction costs, and profits. In addition to predefined investment funds and interest, the transaction costs are also known, while others are undetermined.

Given the total amount of funds M = 50,000, N = 30, the yearly deposit interest rate is 2.5%, T = 20, the business investment amount transaction cost rate is 0.05%. We sort the 25 stocks in accordance with the expected rate of return with a descending sort. All socks are divided into four subsets by the expected rate of return, where in each subset they are sorted by turnover rate. At the initial stage, the overall reward rate is as shown in Table 1. The average excepted return is 2.08, the average risk factor is 1.21, and the average turnover rate is 7.26. The risk factor is initially controlled less than 0.03 (Table 2).²³

Table 1.

Initial setting of excepted return, risk factor, weight, turnover, selection probability of all stocks.

Stock#	Excepted return (%)	Risk factor (%)	Weight	Turnover (%)	Selection probability (%)
1	1.675	1.1	0.036	8.5	3.4
2	1.83	1.6	0.11	7.91	3.8
3	1.93	1.2	0.25	21.62	7.5
4	2.64	2.4	0.02	10.43	7.4
5	1.72	0.74	0.12	4.2	2.35
6	1.92	0.82	0.031	10.61	5.7
7	1.49	0.72	0.035	6.86	4.6
8	1.25	0.61	0.04	3.71	2.7
9	0.92	0.79	0.026	2.45	1.9
10	1.98	0.2	0.05	3.91	2.9
11	4.10	1.7	0.03	3.47	6.8
12	1.82	0.91	0.032	9.24	5.9
13	3.61	1.2	0.03	10.40	6.55
14	3.26	1.7	0.08	5.88	2.8
15	1.19	1.65	0.03	3.98	2.9
16	2.75	1.24	0.012	7.62	3.4
17	0.95	1.45	0.013	9.95	5.2
18	3.13	2.14	0.015	11.81	5.5
19	1.71	1.23	0.03	2.46	0.9
20	2.13	0.994	0.008	4.63	2.1
21	3.13	0.74	0.017	6.98	2.3
22	2.76	1.15	0.018	4.08	4.2
23	1.43	3.14	0.019	8.02	5.2
24	2.51	0.21	0.014	10.95	1.9
25	0.13	0.794	0.021	1.81	2.1

Table 2.

The portfolio at stage n.

Stock#	Risk factor (%)	Weight
2	0.021	0.04
5	0.022	0.16
6	0.039	0.13
7	0.0381	0.12
9	0.0201	0.05
10	0.0388	0.14
11	0.0383	0.09
13	0.0227	0.17
16	0.0275	0.03
22	0.02	0.17

Take stock#1 as example. The monthly difference of return rate is shown in Figure 2.

Figure 2.

The monthly difference of return rate.

The total reward of 2015 and 2016 is shown in Figure 3 where the total reward kept increasing at each month of both years, indicating that our proposed method for investment worked well.

Figure 3.

The total reward of 2005–2016.

Conclusion

Achieving maximizing the benefit in the scope of risk and cash is a widely studied problem by the public. We used ADP to establish a sound Markov decision model for portfolio investment. In the proposed model-based actor-critic algorithm, the corresponding optimal value function is obtained by iteration on the basis of the limited risk range and the fund, and then the optimal investment of each period is solved by using the dynamic planning. The algorithm is able to achieve a stable investment and the income increasing in both long term and short term.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Forbes

Fratzscher

Kostka

et al . Bubble thy neighbour: portfolio effects and externalities from capital controls. J Int Econ 2016; 99: 85–104.

Kucukbay

Araz

Portfolio selection problem: a comparison of fuzzy goal programming and linear physical programming. Int J Optim Control Theor Appl 2016; 6: 121–128.

Tobin

Portfolio theory.

Science 1981; 214: 974.

Markowitz

Mean–variance approximations to expected utility

. Eur J Operat Res 2014; 234: 346–355.

Mangram

Jalbert

TA.

Simplified perspective of the Markowitz portfolio theory. Global J Business Res 2013; 7: 5900–7000.

Alexander

Baptista

AM.

A comparison of VaR and CVaR constraints on portfolio selection with the mean-variance model. Manage Sci 2004; 50: 1261–1273.

Perold

AF.

The capital asset pricing model

. J Econ Perspect 2017; 18: 3–24.

Kalantari

Shahmorad

SA.

Stable and convergent finite difference method for fractional Black–Scholes model of American put option pricing. Comput Econ 2017; 4 : 1–15.

Ure

Geramifard

Chowdhary

et al. Adaptive planning for Markov decision processes with uncertain transition models via incremental feature dependency discovery. In: European conference on machine learning and knowledge discovery in databases, Bristol, UK, 24–28 September 2012, pp.99–115. Berlin Heidelberg: Springer-Verlag Verlag.

10.

B and

, Si

Robust optimality for discounted infinite-horizon markov decision processes with uncertain transition matrices. IEEE Trans Automat Contr 2008; 53: 2112–2116.

11.

Lassaigne

Peyronnet

Approximate planning and verification for large markov decision processes. In: ACM symposium on applied computing, Riva del Garda, Italy, 25–29 March 2012, pp.1314–1319. ACM.

12.

Zhu

et al . A kernel based true online Sarsa(λ) for continuous space control problems. ComSIS 2017; 14: 789–804.

13.

Mnih

Kavukcuoglu

Silver

et al . Human-level control through deep reinforcement learning. Nature 2015; 518: 529.

14.

Lewis

Modares

Karimpour

et al . Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 2014; 50: 1167–1175.

15.

Hasselt

, Guez

Silver

Deep reinforcement learning with double Q-learning. In AAAI, Phoenix, Arizona USA, 12–17 February 2016, pp.2094–2100.

16.

Sutton

Barto

AG.

Reinforcement learning: an introduction, Bradford book. IEEE Trans Neural Netw 2005; 16: 285–286.

17.

Konda

VR.

Actor-critic algorithms. Siam J Control Optim 2000; 42: 1143–1166.

18.

Peters

Vijayakumar

Schaal

Natural actor-critic. Neurocomputing 2008; 71: 1180–1190.

19.

Zhu

et al . Learn to human-level control in dynamic environment using incremental batch interrupting temporal abstraction. ComSIS 2016; 13: 561–577.

20.

Caironi

PVC

and Dorigo

Training and delayed reinforcements in Q-learning agents. Int J Intell Syst 2015; 12: 695–724.

21.

Sutton

RS.

Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inform Process Syst 2000; 12: 1057–1063.

22.

Guiso

Jappelli

Terlizzese

Income risk, borrowing constraints, and portfolio choice. Am Econ Rev 1996; 86: 158–172.

23.

Diao HH. The optimal size of portfolio in shanghai stock market: an analysis based on cost and revenue diversification. Journal of Guangdong University of Business Studies 2011; 26: 40–45.