Abstract
How to get maximal benefit within a range of risk in securities market is a very interesting and widely concerned issue. Meanwhile, as there are many complex factors that affect securities’ activity, such as the risk and uncertainty of the benefit, it is very difficult to establish an appropriate model for investment. Aiming at solving the curse of dimension and model disaster caused by the problem, we use the approximate dynamic programming to set up a Markov decision model for the multi-time segment portfolio with transaction cost. A model-based actor-critic algorithm under uncertain environment is proposed, where the optimal value function is obtained by iteration on the basis of the constrained risk range and a limited number of funds, and the optimal investment of each period is solved by using the dynamic planning of limited number of fund ratio. The experiment indicated that the algorithm could get a stable investment, and the income could grow steadily.
Introduction
The portfolio studies how to get the maximal expected profit controlled by a predefined risk range, or how to minimize the expected risk within a range of expected profit.1,2 For the general portfolio problem, an individual or a business makes a reasonable investment through the price, expense, and the resulting change in the amount of money, so that a limited number of funds are utilized in an optimal way. However, many uncertain factors would affect the portfolio. And moreover, the utility function is generally non-convex. 3 From the view of the long term, many conventional methods are hard to get the best allocation strategy, most of which only attaining the local optimal solution rather than the global optimal solution. It has attracted a lot of concerns about how to establish a reasonable model for the general investment problem and find a global optimal method, and has become a hot topic in recent years.
Portfolio investment is an effective way to control investment risk and attain profit. Some efforts proved that the portfolio investment could effectively reduce the risk by the mean-variance analysis, 4 and by using the proposed portfolio effective boundary model, the stock investors could find out the stocks of the lowest level of risks with the same rate of profit. 5 However, as the investment is often affected by many sophisticated factors, such as economic, social, and investor subjective attributes, which keep constantly changing over time, the so-called optimal solution by some fixed learned investment model which uses multiple investments reduce risk to get as much revenue as possible by document analysis is usually not the one that best fits the situation at the time.
Most portfolio models assumed that the profit on securities was subject to a normal distribution, and all investors were within a single investment period. The variance was used to assess the investment risk. Investors chose to invest according to the expected and variance of the yield combination. However, in the real world securities market, the rate of profit on securities does not necessarily obey the normal distribution, and the investors have their own risk aversion and risk appetite. What’s more, the investment is often influenced by uncertain factors. As a result, the difficulty of investment increased. The newly emerged reasonable models, which are classified as modern portfolio, are able to solve the problem in a better way, with the goal of optimal combination scale and the investment ratio. In a combination of investment, more portfolio investment increases the risk of the entire combination, but decreases the fee of investment, and the risk no longer fails when the number of assets in the portfolio reaches a certain number.The risk of investment combination is characterized by the covariance between the assets of the portfolio. Under specified conditions, there is a set of investment ratios that minimize the combined risk.
Most conventional portfolio investment models frequently first divide a single stage problem into a multi-stage one, and then use dynamic programming methods to solve the problem. With the development of the modern financial market, there are more and more types of investments, and as a result, the data generated by the multi-stage problem and the investment finance in the transaction cost constitute the curse of dimensionality. Moreover, the multi-stage portfolio investment is in fact a nonlinear dynamic stochastic process, which especially needs to consider the uncertain factors of each period.
In practice, dynamic programming has numerous limitations in dealing with portfolio investment problems. First, the fundamental inputs required for the model contain the variance of the current state, and the covariance between the two groups. The great number of stock combinations causes the curve of dimension as there are too many states. Secondly, data often contain errors that are caused by unreliability and uncertainty of the optimal results. The input data into the model are usually estimated by the previous data. If the data were estimated to be free from errors, the conventional model guarantees a valid portfolio, but as normally the expected data are unknown and require statistical estimation, it is very likely for the estimated input data to contain bias or even errors, which often result in incorrect product investment ratio. Thirdly, in the presence of investment transaction costs, an uncertain random information may affect the optimal solution, and cause the instability of the solution. Moreover, in the long-term investment process, if the investment cycle is within a certain period of duration, the adjustment of proportion of assets in the portfolio investment will lead to increasing transaction costs.
Many models, such as the mean-variance model, 6 the capital asset pricing model, 7 and the Black–Scholes option pricing model, 8 assume that the market is free of friction, that is, there are no tax and transaction costs. However, most securities markets are very complex, not only due to the existence of transaction costs, but also due to the liquidity and turnover, along with the correlation between the stocks. All these factors make the problem harder to be solved.
In this paper, aiming at solving the above-stated problems and related model defects generated by dynamic programming, we consider using approximate dynamic programming to establish a reasonable Markov Decision Process (MDP) model 9 for the portfolio investment problem. Under various constraints, we designed different decision-making goals for different targets, combined with actor-critics algorithm and piecewise linear function approximation method to solve the optimal investment policy, so as to achieve a certain risk within a stable investment.
MDP
MDP model
An MDP model can be denoted by a quadruple 〈S, A, f, ρ〉, where S is the state set, A is the action set, f: S × A→S is the state transition function, ρ: S × A→R is the reward function.10–12 The reward value rt+1 of time t is
According to the policy π, the state st transfers to the next state st+1
The reward function is
Given an infinite state sequence {s0, s1, … , sn, …}, the total amount of the discount reward received is
However, in practice, the steps are finite and it generally has a termination step which only represents the termination and is not involved in the computing reward. The state sequence {s0, s1, … , sn, …} is usually denoted by {s0, s1, … , sn, sn+1}, where the state si (0 ≤ i ≤ n) is the usual state and the state sn+1 represents the termination state, and correspondingly equation (4) can be redefined as
Reinforcement learning
Reinforcement learning algorithms usually evaluate the policy π using the action value function Qπ(s, a) and the state value function Vπ(s).13,14 The action value function Qπ(s, a) refers to the cumulative reward obtained by taking the action a under the state of the s and the policy π, and the state value function Vπ(s) is the cumulative reward from the specific under the policy π.
According to the definition, the action value function
15
is calculated as
According to the policy π that is learned by the algorithm, taking the action a, the subsequent state of state s can get by transition function f
We can get
Given a state action sequence {(s0, a0),(s1, a1), … , (sn, an), (sn+1, an+1)}, where the state action pair (si, ai) (0 ≤ i ≤ n) represents taking action ai in state si, and the state action pair (sn+1, an+1) only represents the termination state and action which is not used for computing the reward. We can get a discount reward for Qπ(s, a) by state action sequence {(s0, a0),(s1, a1), … , (sn, an)}
Then
Therefore, we can get the general form of Qπ(s, a)
The state value function is
Similarly, we can get the discounted reward of state function Vπ(s)
In the undeterministic environment, there are two aspects of uncertainty that should be considered. First, the successive state can be random and cannot be determined by the current state and the selected action; second, the reward could also be uncertain. The corresponding action value function Qπ(s, a) and the state value function Vπ(s) need to be changed.
In the MDP model of the undeterministic environment, the state transition function f is determined using the state transition probability function: S × A×S → [0,1] instead of determining the state transition function f in the environment. In this way, according to the probability of the strategy π, at time t, with the state st and the action at, the state transitions to the successive state st+1 ∈ St+1 are
First, consider uncertainty in state transfer. After taking the action at in the state st, and transfers to the state st+1 with a transition probability rather than directly transferring to the state st+1.
In an undeterministic environment, the state s obeys the distribution •, according to the policy π, the discounted reward from a certain starting state s0 is
Therefore, the general form of Qπ under the uncertain environment is
Second, consider uncertainty in reward. Taking into consideration of uncertainty of the action taken means that the action corresponding to the policy π can be any of the action of the action set. The undeterministic environment also needs to consider the instability of the reward value and the reward value also multiplies a probability. In this way, the reward value of the unstable environment represented by the left part of equation (18) becomes
Correspondingly, the right part of equation (18), the Qπ, is also changed as
Therefore, Qπ in undeterministic environment is
Similarly, Vπ in an unstable environment is
In the reinforcement learning, the policy that maximizes the expected accumulative reward, return, is called the optimal strategy π*, and the corresponding optimal action value function is Q* (s, a), and the optimal state value function is V*(s). Therefore, for any strategy π and the state action pair (s, a), there exists Q* (s, a) ≥ Qπ (s, a); for any policy π and state, there exists V*(s) ≥Vπ(s). Although a reinforcement learning problem may have multiple optimal strategies at the same time, the optimal action value function or the optimal state value function is unique, and is updated by
Actor-critic model
Different from value function-based reinforcement learning methods, the actor-critic algorithm has two independent structures, one for storing and updating the value function and the other for storing the updated policy.16,17 The agent selects the action according to the policy rather than the value function, where the policy part is called the actor, which performs an action, updates the value of the function, and makes use of value function to evaluate the action, and the value function part is called critic. The value function of the critic part can also use temporal difference error (TD error) which is calculated by TD learning method, e.g. Q-learning, that is able to learn directly from raw experience without a model of the system and is thus suitable for decision of dynamic and uncertain system. The framework of the actor-critic algorithm is shown in Figure 1.

An illustration diagram for framework of actor-critic algorithm.
The advantage of the actor-critic algorithm is to separate the policy from the value function, using linear approximation to learn the value function and the policy function, where the critic part is the value function approximator, 18 learning the estimate function, and then passed to the actor part. The actor part is a policy approximator, which learns a random strategy and uses the gradient-based policy update method to select the action. Then critic part uses the time difference algorithm to estimate each state value function caused by the action in the actor policy iteration process, using the estimated value function to evaluate the action selected by the actor to find the maximum value of the local or overall cumulated reward to provide more effective reinforcement feedback signal to the actor, and update the actor policy according to the random gradient. The actor-critic algorithm is much simpler than Q-learning in the computation, 19 can determine the optimal policy, and effectively applied to control the tasks.20,21
Markov chain model for investment
Prerequisites for the model
Given an investor with available cash C – a total of n kinds of investment, including stocks, funds, bonds, foreign exchange, etc. – and the maximal loss risk m_r, the goal of the investment is to get the most benefit with the least risk, the maximum benefit-risk ratio. If the benefit in a certain period of duration is treated as a dynamic process, we can establish a Markov chain model to solve the problem. The model describes the state of the portfolio investment, including stock code of holding, the number of holdings, the expected rate of return, risk score, and average turnover rate of each stocks in the portfolio investment. It can be represent by a vector
Given the investment transfer function Ø, the value of Ø is the probability of the transfer of funds. In particular, the value of f being 0 means that there is no transfer of funds and maintaining the holding of the investment; as a result, the investment at the next stage is in consistent with that the investment at the current stage. The value of Ø being 1 means that there is a transfer of funds and changing the holding of the investment; as a result, the investment of the next stage is completely inconsistent with the investment at the current moment.
Given a stock, if the agent decides to increase investment holdings, r(st, a) is positive; otherwise, if the agent determines to decrease investment holdings, r(st, a) is negative. And we use post-decision state vectors for decision
Considering the transaction costs, we define the cost of holding of financial products is 0, denoted by ca1= 0; in the state t, transferring from J financial product to another financial products needs cost, ca1xt,a1.
Therefore, the transaction cost during the investment period is
When the value of the asset changes, the relative return of the market is defined as
From state
The initial stage of R is assigned as M. The portfolio investment problem contains a variety of stochastic information, such as individual stocks and portfolio of expected rate of return, risk factors, turnover rate, which have increased the difficulty of solving the problem; the yield and risk factors are estimated as follows.
Expected returns
Given a stock pool sp,
According to equation (30), the expected return of the i-th portfolio at stage t with state st is
The update of the weight of the i-th stock in the t-th stage is
Risk factor
The risk factor of a stock is
And the risk factor of a portfolio is
In a portfolio investment, the smaller the correlation between the selected stocks results in the smaller the impact on each other, and the smaller the risk factor, where the correlation is reflected through the correlation coefficient between the two shares of the stock, denoted by
Value function
We use V(R) to represent the total value of the securities R,
From the above analysis, we can see in a portfolio investment that a greater risk factor has a smaller turnover rate, that a higher yield has a greater the weight, and that a smaller correlation of any two stocks has a smaller risk factor. Since the investor randomly chooses the same type of investment, it is hard to attain the exact value function at each stage. So it is necessary to get approximate solution by assigning different weights to different financial products. The approximate value function is denoted by
Here we use a utility function to measure the benefit by
Investors can receive a number of possible portfolio based on a certain range of risk and rate of return, and obtain the maximum utility expectations of the portfolio, which is called the optimal portfolio investment. The convexity of the indifference curve and the convexity of the effective set determine that the optimal portfolio is unique. The objective function can be written as
Let
Then the optimal objective function is
Algorithm description
We use softmax as the policy
The investment rate of return on the stock is sorted in descending order. The stock with the return that is greater than the current stock's maximum profit with least risk by considering risk factor is selected. The available policy is within a certain range, and those policies that exceed the range of risk factors will be discarded.
The actor-critic based algorithm of optimal portfolio investment under uncertain environment is as follows PI: state of the portfolio investment, including stock code of holding, the number of holdings, the expected rate of return, risk score, and average turnover rate of each stocks in the portfolio investment. C: the transaction cost rate. M0:the total amount of funds at initial step. w: the weight of the single stock. Mt: the total amount of funds at stept. Xt: investment at time step t. (1) initialize action set U (2) set threshold for deviation ε (4) initialize time step t as 0 (5) initialize expected benefit rate r (6) initialize risk factor riskt, expected benefit rate r at time step t (7) (8) compute expected benefits as investment times expected benefit rate at time step t: (9) compute value function by equation (44) (10) get current deviation at time step t : (11) (12) (13) (14) update rt, riskt (15) go to step (7) (16) (17) (18) transfer state (19) (20) get optimal investment : (21)
Experiment
The problem assumes that the risk matrix is based on historical data and follows the law of the single stock income-risk in the positive direction.22 There are three principles that are used to guide the investor for selecting the stock. First, the net income matrix has three important elements: the type of stock, net income and the amount of money invested. Second, the choice between the stock is not relevant, and it is up to the investor to decide the type and the holding number of stock. Third, total earnings include interest from the bank whose interest rate is predetermined, transaction costs, and profits. In addition to predefined investment funds and interest, the transaction costs are also known, while others are undetermined.
Given the total amount of funds M = 50,000, N = 30, the yearly deposit interest rate is 2.5%, T = 20, the business investment amount transaction cost rate is 0.05%. We sort the 25 stocks in accordance with the expected rate of return with a descending sort. All socks are divided into four subsets by the expected rate of return, where in each subset they are sorted by turnover rate. At the initial stage, the overall reward rate is as shown in Table 1. The average excepted return is 2.08, the average risk factor is 1.21, and the average turnover rate is 7.26. The risk factor is initially controlled less than 0.03 (Table 2).23
Initial setting of excepted return, risk factor, weight, turnover, selection probability of all stocks.
The portfolio at stage n.
Take stock#1 as example. The monthly difference of return rate is shown in Figure 2.

The monthly difference of return rate.
The total reward of 2015 and 2016 is shown in Figure 3 where the total reward kept increasing at each month of both years, indicating that our proposed method for investment worked well.

The total reward of 2005–2016.
Conclusion
Achieving maximizing the benefit in the scope of risk and cash is a widely studied problem by the public. We used ADP to establish a sound Markov decision model for portfolio investment. In the proposed model-based actor-critic algorithm, the corresponding optimal value function is obtained by iteration on the basis of the limited risk range and the fund, and then the optimal investment of each period is solved by using the dynamic planning. The algorithm is able to achieve a stable investment and the income increasing in both long term and short term.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
