Abstract
Financial markets are notoriously complex environments, presenting vast amounts of noisy, yet potentially informative data. We consider the problem of forecasting financial time series from a wide range of information sources using online Gaussian Processes with Automatic Relevance Determination (ARD) kernels. We measure the performance gain, quantified in terms of Normalised Root Mean Square Error (NRMSE), Median Absolute Deviation (MAD) and Pearson correlation, from fusing each of four separate data domains: time series technicals, sentiment analysis, options market data and broker recommendations. We show evidence that ARD kernels produce meaningful feature rankings that help retain salient inputs and reduce input dimensionality, providing a framework for sifting through financial complexity. We measure the performance gain from fusing each domain’s heterogeneous data streams into a single probabilistic model. In particular our findings highlight the critical value of options data in mapping out the curvature of price space and inspire an intuitive, novel direction for research in financial prediction.
Introduction
One of the central challenges in financial forecasting is determining where to look. A financial instrument’s time series history, comparables and derivatives, news articles and opinion pieces all have the potential to influence price evolution. Developing a robust framework for knowledge extraction from disparate, jointly informative datasets remains an open challenge for the finance and machine learning communities.
In this paper we forecast daily returns on the S&P500 index, a broad market benchmark for US equities commonly viewed as a gauge of financial stability. The S&P500 is a market capitalisation-weighted index of the 500 largest corporations in the US, covering the full range of technology, consumer goods, utilities and financial services companies. It is one of the most visible benchmarks in the world, actively traded by buy-and-hold mutual funds and high-frequency hedge funds alike.
We begin by postulating four broad categories in which to search for salient explanatory variables.
We show that predictive performance improves when combining signals from each domain, and provide a principled framework for the triage of inputs by implementing Automatic Relevance Determination (ARD) in the covariance parametrisation of an adaptive Gaussian Process model. The ranking that emerges from this analysis defies expectations, and encourages further investigation of options markets and the price space representation.
Prior work
We first establish context for our study by reviewing relevant prior research in the domain of financial prediction using each of our various data streams. We then turn our attention to common practice multivariate analysis techniques.
Technical analysis was one of the earliest forms of financial forecasting, first appearing in merchant accounts of the Dutch markets in the 17th century. Formalised as a discipline in the 1940s (Edwards and Magee, 1946), it involves the use of price and volume time series to make directional forecasts. It has been extensively studied in prior regression analyses (Lo et al., 2000) demonstrating the incremental gains in predictive performance provided by identifying specific patterns in price history. Technicals-driven Gaussian Process regression has been applied to forecasting time-series in a wide range of asset classes, including stock market prices (Farrell and Correa, 2007), stock market volatility (Ou and Wang, 2009) and commodity spreads (Chapados and Bengio, 2007). These studies show that model performance is highly reliant on the size of the training set.
Literature on financial prediction using text data has proliferated in recent decades, closely tracking advances in the field of natural language processing. The methodology in this domain has typically involved converting words or phrases into numerical gauges of sentiment with which to predict stock market direction (Nikfarjam et al., 2010). Modelling techniques have ranged from simple Naive-Bayes or Support Vector Machine classifiers to more advanced algorithms built on deep learning. More recent work in sentiment composition has sought to predict economic indicators like the U.S. Non-Farm Payrolls using newsflow data. These studies show evidence that accurate parsing of news articles can produce state-of-the-art forecasts for market-moving announcements (Levenberg et al., 2013, 2014).
Research on the interactions between stock and options market prices has been scarce, though early attempts were made to assess correlation in the volume data. Studies indicated that call options flow lead underlying shares flow with a one-day lag, lending credence to the hypothesis of a sequential flow of information between the options and stock markets (Anthony, 1998).
Multiple studies have been conducted to ascertain the influence of buy and sell recommendations on stock prices. Research on equity analyst reports show significant, systematic but asymmetric drift in the aftermath of broker actions, with short-lived, modest gains following upgrades but durable, material sell-offs following downgrades (Womack, 1996). The magnitude of these changes depended not only on the action (upgrade vs. downgrade), but also on the reputation of the analyst, the size of their brokerage firm and the size of the recommended firm (Stickel, 1995).
Various techniques have been applied to multivariate analysis in finance, relying on Independent Component Analysis to reduce dimensionality (Lu et al., 2009) and elliptical copula models to capture input dependencies (Biller and Corlu, 2012). These studies find incremental information gain in using multiple time series from the same domain. By contrast, our work focuses on heterogeneous data fusion and modelling inter-domain dependencies.
Data
In this section we detail the features considered for each of the four domains under consideration, all of which will be used to predict
Technical indicators
Market technicals are metrics derived directly from the price history p(t) of a financial instrument. We consider four features commonly watched in industry (Taylor and Allen, 1992): the
We do not believe that the formulation of these metrics is inherently meaningful, but rather that standardised definitions provide precise, measurable thresholds at which chartist market participants will react. Including these features will allow our model to identify those thresholds and thereby anticipate technically-led order flow.
While factual newsflow is significant, it is specifically the polarity of its interpretation by markets - as beats or disappointments - that drives market movement. Market sentiment was captured using indicators derived from both Twitter and Stocktwits, a social media site dedicated to real-time discussions of financial markets and actively frequented by S&P500 retail investors. Two further metrics were derived by tracking the daily changes in the sentiment indices.
Options-based modelling of price space
As a province reserved for more sophisticated traders, options market open interest volumes offer a window into the expectations of the most experienced, well-capitalised participants. As strike-sensitive instruments
1
, options data also allow us to gauge how these expectations vary at different price levels, motivating the representation of price as an inhomogeneous space with identifiable regions of high directional bias or variance. Illustratively, high open interest (OI) in call options coupled with low open interest in put options indicates experts pre-positioning for a rally. By contrast, high open interest in straddles
2
at a given strike implies low consensus among experts about
To capture the directionality and viscosity implied by open interest data, we constructed two metrics. Directionality measures the daily change in call minus put open interest at strike
The parameter
In parametrising viscosity, we make three modelling assumptions. Firstly, the pinning effect of high straddle open interest is at its greatest for options very near their expiry date. Secondly, this effect decays as live prices move away from the straddle’s strike
We expect a significant negative correlation between viscosity and the magnitude of S&P500 next-day returns; as such tuning
Market analysts issue recommendations on individual stocks rather than on the broad market - partly a reflection of the incentive structure for brokerage firms: commissions are substantially larger for actively managed portfolios than for passive index-trackers.
To overcome this, we construct an index of broker opinions, based on a weighted sum of broker recommendations across the top 100 stocks in the S&P500. These account for 63% of the index’s market capitalisation, and broker actions on these household names have a disproportionate effect on the index as a whole. Two indices were built from these weighted sums, to track both changes in analyst opinion (upgrades and downgrades) and the consensus state (buy, hold or sell).
ARD Gaussian Processes
We briefly recall the fundamentals of Gaussian Process modelling before describing ARD kernels and the associated notion of relevance score. For a comprehensive treatment of Gaussian Processes, please refer to Rasmussen and Williams (2006).
A Gaussian Process is a collection of random variables, any finite subset of which has a joint Gaussian distribution. Gaussian Processes are fully parametrised by a mean function and covariance function, or kernel. Given a real process
Inputs are commonly centered during pre-proces-sing. For a given training set
To counter overfitting, we introduce
The covariance function above employs an isotropic Manhattan norm as the similarity measure between two vectors in input space. This assumes that a single, global characteristic length scale
In ARD kernels, the scalar input length scale
ARD algorithms have been successfully used in research ranging from bioinformatics (Campbell and Tipping, 2002) to seismography (Oh et al., 2008), providing an effective tool for pruning large numbers of irrelevant features. A limitation of the methodology as presented is that the relevance scores only provide a relative ranking between the features of a model. Two equally meaningless inputs will have relevance scores of similar magnitude, as would two equally meaningful features. On their own, these scores provide little basis for performing dimensionality reduction. To overcome this, we include in each regression a baseline feature composed of standard Gaussian noise. We assert that a meaningful input should have a relevance score that is at least two orders of magnitude greater than noise, so by computing the Relevance Ratio we can determine which features are objectively informative.
In this section we outline the findings of our analysis. We begin by discovering relevance hierarchies in the data using ARD, before proceeding with model testing and benchmarking. Model performance metrics were derived using market data from Jan-13 to Dec-14 for training and Jan-15 to Apr-15 for testing. The price history of the S&P500 Index for this period is provided in Fig. 1.
Correlation analysis
We begin by running a correlation analysis on each feature of the training set, grouped by domain and collect the findings in Table 1. In most cases, rank correlations are stronger than linear correlations, though the variations are too marginal to alter the analysis. For brevity, in all ensuing sections we have adopted the linear definition of correlation.
We next outline a methodology for determining whether an observed sample correlation is significant. Given two independent random variables
The use of 4 distinct domains stemmed from the belief that, by virtue of tracking different market agents, these datasets will exhibit low correlation with each other and therefore enhance the predictive power of a combined model. In Table 2 we measure the correlation between input pairs in the training set, and find indeed that
Using training data from Jan-13 to Dec-14, we implement separate Gaussian Process regressions for each data domain using the Matérn 3/2 ARD kernel. This allows both a ranking of feature relevance within each domain, and bivariate visualisations of the mean surfaces learned from the two top-ranked features in each model. Relevance is ranked on the basis of Relevance Score and Relevance Ratio defined in Equations (15) and (16) respectively, with results for market technicals provided in Table 3.
Whilst the MACD-derived signal line and previous day’s return explained little of the variation in output, the 50-day Simple Moving Average was salient, as was the MACD. Figure 2 provides a heatmap of return variation based on the two top features of the technical domain, MACD and 50dMA(t), indexed by percentile score. As a first approximation, MACD and next-day returns move inversely: cheapness with respect to recent history correlates with next-day gains.
Table 4 provides an analysis of sentiment feature relevance. Stocktwits sentiment data is significantly more informative than Twitter data, to the point where the Twitter feature is irrelevant and can be discarded. As a social media site focused on finance, it is likely that Stocktwits’s polarity reflects solely market sentiment, whereas Twitter’s captures public opinion on a wide range of market-irrelevant issues (celebrity gossip, local politics). The 1-day change variables were also meaningless and can be discarded from subsequent analysis. Notably, the mean function learned through GP regression calls into question the wisdom of crowds: as Fig. 3 indicates, optimism on Stocktwits foreshadows broad market declines, and conversely. Sentiment analysis lends credence to the Warren Buffett adage: “be greedy when others are fearful, fearful when others are greedy.”
Relevance for options-derived metrics is provided in Table 5. Directionality and viscosity were almost equally relevant, with positive directionality - that is, experts pre-positioning for rallies via call options -anticipating positive next-day returns. Viscosity instead tracked areas of return compression, and acted as a form of friction. This manifests in Fig. 4 as areas of peak return coinciding with high directionality and low viscosity.
The relevance of broker actions is assessed in Table 6. Broker upgrades and downgrades are infrequent occurrences, resulting in a sparse Broker Change input. The Matérn 3/2 kernel is capable of learning the non-smooth behaviour exhibited in Fig. 5, but with relevance metrics indistinguishable from Gaussian noise, it is unlikely this domain will provide meaningful improvements to a combined model. This suggests that analyst opinions have little predictive power, and merely reflect market changes after they’ve occurred.
Retaining only the salient features, we run a high-dimensional Gaussian Process regression on relevant inputs from all domains simultaneously. The results, compiled in Table 7, broadly mirror our expectations from the correlation analysis, highlighting the ARD framework’s ability to discover structure in the data.
Model performance
Having established a method for identifying salient features, we now turn our attention to the predictive performance of ARD Gaussian Processes using each data domain. We separately test the predictive value of each domain before fusing them into a combined model, and measure performance according to the Pearson correlation between forecasts and observations, Median Absolute Deviation and Normalised Root Mean Square Error, where the normalisation constant is the standard deviation of the observations. The results are provided in Table 8.
The model registers monotonic improvements in performance when additional features are included, with the options market data providing the greatest gain. Moreover, it strictly outperforms traditional financial models such as look-ahead AR Processes on measures of ground-truth correlation and NRMSE. Benchmark performance levels are included in Table 9.
Over timeframes much larger than our study’s 28-month window, supervised batch algorithms in finance run the risk of failing to recognise significant changes in the landscape fast enough. For example, Stocktwits sentiment’s relevance would have been very low when the site was launched in 2009, and grew in tandem with the size of its user base. A solution to this challenge involves adaptively learning the kernel hyperparameters from recent history only, removing the impact of old, potentially irrelevant data. The evolution from offline to online, adaptive learning follows straightforwardly: we define a window
Predictive performance dips to impractical levels below the
In Table 11 we provide performance metrics on benchmark adaptive models such as one-step-ahead AR and autoregressive Kalman Filters with varying lags, and find the Adaptive ARD GP yields both superior results and the benefit of automatic, interpretable feature selection.
Conclusions
Extracting information from multiple domains presents the dual challenge of identifying both what to pick and how to mix. Our results provide a principled framework for reducing input dimensionality through iterative ARD GP regression. We show measurable gains in predictive performance from fusing multiple data streams together in an online setting, and draw particular attention to the relevance of options market data and the implicitly inhomogeneous representation of price space. As an untapped, feature-rich, strike-dependent dataset shaped by the interactions of informed players, options market salience provides a strong mandate for further research into data-driven modelling of price space and its implications for financial forecasting.
Footnotes
1
Call and put option prices are calculated via the Black-Scholes formula and depend on a ’strike’ level that defines the price at which the option owner may buy or sell the underlying asset.
2
A long straddle position refers to the ownership of both a call and a put option at the same strike price and expiry date: it does not express a directional view, and benefits so long as the underlying asset deviates sufficiently from the strike before expiry.
