Abstract
Recently, a class of classifiers, called relative margin machine, has been developed. Relative margin machine has shown significant improvements over the large margin counterparts on real-world problems. In binary classification, the most widely used loss function is the hinge loss, which results in the hinge loss relative margin machine. The hinge loss relative margin machine is sensitive to outliers. In this article, we proposed to change maximizing the shortest distance used in relative margin machine into maximizing the quantile distance, the pinball loss which is related to quantiles was used in classification. The proposed method is less sensitive to noise, especially the feature noise around the decision boundary. Meanwhile, the computational complexity of the proposed method is similar to that of the relative margin machine.
Introduction
Support vector machines (SVMs) are one of the leading techniques for pattern classification and function approximation. SVMs try to find an optimal hyperplane that maximizes the margin between two classes. The margin is defined as the minimum distances of the class sample distances to the decision hyperplane. While the large margin methods have been very successful over the last decades, their solutions may be misled by the spread of data and preferentially separate classes along large spread directions. In other words, their solutions can easily be perturbed by an affine or scaling transformation of the input space. For instance, by transforming all training and testing inputs by an invertible linear transformation, the SVM solution and its resulting classification performance can be signicantly varied. So, controlling spread of data has been an important theme in classification problems. Recently, Shivaswamy and Jebara 1 proposed an effective and less computationally expensive way to incorporate the spread of the data-second-order information about the distance between hypotheses when projected onto the line defined by the weight vector w. The method is called relative margin machine (RMM), which can deal with the presence of arbitrary affine transformations. In Shivaswamy and Jebara, 2 the distance of the data from the separating hyperplane is bounded from above by a scalar R. RMM maximizes in this way the relative margin (relative to that upper bound) of the data from the separating hyperplane. The motivation behind this line of research is the fact that large margin on its own is not a meaningful quantity; a better way to measure margin is in relation to the spread of the data. RMM has shown significant improvements over the large margin counterparts on real-world problems.
Classification problems may have noise on both class noise and attribute noise. Noise can reduce system performance in terms of classification accuracy. In binary classification, the most widely used loss function is the hinge loss, which results in the hinge loss SVM. The sensitivity to noise or the instability to re-sampling comes from the fact that the hinge loss in SVM. Similar to the classical SVM, the basic idea of RMM is still use the hinge loss, which is essentially sensitive to noise. There have been many approaches for noise handling. The commonly used approach constructs the robust models by introducing weighted values to errors caused by training samples.3–6 Another approach constructs the robust models by the second-order cone programming.7–9 Recently, some algorithms based on the mean of classed and the total margin have been proposed.10,11 Other related works also can be found in literature.12–14 Huang et al. proposed to change maximizing the shortest distance used in classical SVM into maximizing the quantile distance, which preserved the formulation of the classical SVM. 15 In Huang et al., 15 the pinball loss which is related to quantiles was used in classification and find that SVM with the pinball loss shares many good properties of the hinge loss SVM.
Inspired by the above studies, we proposed a robust relative margin machine (RRMM) to improve the robustness of RMM to outliers. In form, the difference between the RMM and the proposed method is that the pinball loss is used instead of the hinge loss. Similar with RMM, RRMM introduces constraints which bound the spread of the projection of the data. On the other hand, introducing the pinball loss into classification brings noise insensitivity. Numerical experiments show that the proposed algorithm gives promising results.
The article is organized as follows. In “SVM and RMM” section, we first present the primal formulation of SVM and RMM. In “RRMM” section, we formulate the robust relative margin support vector machine (RRMM) and its dual optimization. “Experiments” section performs experiments on benchmark datasets to investigate the effectiveness of the RRMM. The last section concludes this article.
SVM and RMM
SVMs represent novel learning techniques that have been introduced in the framework of structural risk minimization and in the theory of VC bounds. In the simplest binary pattern recognition tasks, SVMs use a linear separating hyperplane to create a classifier with margin. A quadratic programming formulation of SVM is given as follows:
The solution can be found as
In Shivaswamy and Jebara,
1
the SVM was modified such that the projections on the training examples remain bounded. A parameter was introduced that helps trade off between large margin and small spread of the projection of the data. The RMM can be formulated as the following model:
An illustration of RMM.
The formulation has one extra parameter R in addition to the SVM parameter C. When R is large enough, the above RMM gives the same solution as the SVM. Also note that only settings of
RRMM
Pinball loss
The pinball loss function is given as follows, which can be regard as a generalized
The idea of RRMM can be formulated into the following optimization problem:
The above formulation will not be solved since it is computationally impractical. Solving the dual of (3) requires semi-definite programming, which prevents the method from scaling beyond a few hundred data points. Instead, we propose an equivalent optimization, which gives the same solution but only requires quadratic programming. This is achieved by simply replacing the constraint
The problem is further equivalently transformed into
Parameter C controls the trade-off between the complexity and proportion of non-separable samples. Notice that when
In order to solve this problem, we construct the Lagrangian function
Minimize it with respect to
Substituting (7) into the Lagrangian function (6) and combining with (8) and (9), we have the dual of optimization problem (5) as follows.
By solving the above dual problem, we obtain the Lagrangian vector
Experiments
In the experiment, we compared the performance of the RRMM with the hinge loss SVM, Pinball-SVM (PSVM) and RMM on several publicly available data sets from UCI machine learning repository. 18 Numerous experiments were run on these datasets to assess the impact of the existence of noise on classification accuracy.
For most of the datasets we used, they don’t actually contain noise, so we use manual mechanisms to add the attribute noise. We let the attributes corrupted by zero-mean Gaussian noise. We use this noise to construct a noisy training set. In this experiment, the test set is clean. For each attribute, the ratio of the variance of noise to that of the attributes, denoted as r, is set to be 0 (i.e. noise-free), 0.03, and 0.1.
In each algorithm, the optimal parameter C is searched from set
The performance of these algorithms heavily depends on the choices parameters. In order to seek them out, we tune the parameters based on tenfold cross validation. That is to say, the data set is randomly split into 10 subsets, and one of those sets is reserved as a test set, and the others are reserved as a train set. This process is repeated 10 times, and the average of 10 testing results is used as the performance measure. While keeping the proportion of examples in each class constant, the best parameters corresponding to the lowest testing errors are used to predict the corresponding testing set. On the other hand, the dual formulation of the RRMM has
Experimental comparison of SVM, RMM, PSVM, and RRMM on classification accuracy and standard deviation.
Conclusions and discussion
In this article, we introduced the pinball loss to classification problems, resulting in the robust relative margin SVM. The dual formulation of the RRMM is given. Compared with the hinge loss RMM, the major advantage of the proposed method is that the RRMM is less sensitive to noise, especially the feature noise around the decision boundary.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Natural Science Foundation of China (No. 61170174) and Tianjin Training plan of University Innovation Team (No.TD 12-5016).
