Robust relative margin support vector machines

Abstract

Recently, a class of classifiers, called relative margin machine, has been developed. Relative margin machine has shown significant improvements over the large margin counterparts on real-world problems. In binary classification, the most widely used loss function is the hinge loss, which results in the hinge loss relative margin machine. The hinge loss relative margin machine is sensitive to outliers. In this article, we proposed to change maximizing the shortest distance used in relative margin machine into maximizing the quantile distance, the pinball loss which is related to quantiles was used in classification. The proposed method is less sensitive to noise, especially the feature noise around the decision boundary. Meanwhile, the computational complexity of the proposed method is similar to that of the relative margin machine.

Keywords

Support vector machine hinge loss quantile distance noise relative margin

Introduction

Support vector machines (SVMs) are one of the leading techniques for pattern classification and function approximation. SVMs try to find an optimal hyperplane that maximizes the margin between two classes. The margin is defined as the minimum distances of the class sample distances to the decision hyperplane. While the large margin methods have been very successful over the last decades, their solutions may be misled by the spread of data and preferentially separate classes along large spread directions. In other words, their solutions can easily be perturbed by an affine or scaling transformation of the input space. For instance, by transforming all training and testing inputs by an invertible linear transformation, the SVM solution and its resulting classification performance can be signicantly varied. So, controlling spread of data has been an important theme in classification problems. Recently, Shivaswamy and Jebara¹ proposed an effective and less computationally expensive way to incorporate the spread of the data-second-order information about the distance between hypotheses when projected onto the line defined by the weight vector w. The method is called relative margin machine (RMM), which can deal with the presence of arbitrary affine transformations. In Shivaswamy and Jebara,² the distance of the data from the separating hyperplane is bounded from above by a scalar R. RMM maximizes in this way the relative margin (relative to that upper bound) of the data from the separating hyperplane. The motivation behind this line of research is the fact that large margin on its own is not a meaningful quantity; a better way to measure margin is in relation to the spread of the data. RMM has shown significant improvements over the large margin counterparts on real-world problems.

Classification problems may have noise on both class noise and attribute noise. Noise can reduce system performance in terms of classification accuracy. In binary classification, the most widely used loss function is the hinge loss, which results in the hinge loss SVM. The sensitivity to noise or the instability to re-sampling comes from the fact that the hinge loss in SVM. Similar to the classical SVM, the basic idea of RMM is still use the hinge loss, which is essentially sensitive to noise. There have been many approaches for noise handling. The commonly used approach constructs the robust models by introducing weighted values to errors caused by training samples.^3–6 Another approach constructs the robust models by the second-order cone programming.^7–9 Recently, some algorithms based on the mean of classed and the total margin have been proposed.^10,11 Other related works also can be found in literature.^12–14 Huang et al. proposed to change maximizing the shortest distance used in classical SVM into maximizing the quantile distance, which preserved the formulation of the classical SVM.¹⁵ In Huang et al.,¹⁵ the pinball loss which is related to quantiles was used in classification and find that SVM with the pinball loss shares many good properties of the hinge loss SVM.

Inspired by the above studies, we proposed a robust relative margin machine (RRMM) to improve the robustness of RMM to outliers. In form, the difference between the RMM and the proposed method is that the pinball loss is used instead of the hinge loss. Similar with RMM, RRMM introduces constraints which bound the spread of the projection of the data. On the other hand, introducing the pinball loss into classification brings noise insensitivity. Numerical experiments show that the proposed algorithm gives promising results.

The article is organized as follows. In “SVM and RMM” section, we first present the primal formulation of SVM and RMM. In “RRMM” section, we formulate the robust relative margin support vector machine (RRMM) and its dual optimization. “Experiments” section performs experiments on benchmark datasets to investigate the effectiveness of the RRMM. The last section concludes this article.

SVM and RMM

SVMs represent novel learning techniques that have been introduced in the framework of structural risk minimization and in the theory of VC bounds. In the simplest binary pattern recognition tasks, SVMs use a linear separating hyperplane to create a classifier with margin. A quadratic programming formulation of SVM is given as follows:

min \frac{1}{2} ‖ w ‖ 2 + C \sum_{i = 1}^{m} ξ_{i} s . t . y_{i} (w \cdot Φ (x_{i}) + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0 i = 1, \dots, m

(1)

The solution can be found as $w = \sum_{i = 1}^{m} α_{i} y_{i} Φ (x_{i}),$ it achieves equality for nonzero values of $α_{i}$ only. The corresponding data samples are called support vectors. Consequently, only data points close to the decision boundary are considered, making SVM a local method in this way. More precisely, it exclusively measures the margin by the points near the classification boundary regardless of how spread the remaining data are away from the separating hyperplane. Hence, it is sensitive to directions with large data spread, and its solution may be misled by the spread of data and preferentially separates classes along large spread directions.

In Shivaswamy and Jebara,¹ the SVM was modified such that the projections on the training examples remain bounded. A parameter was introduced that helps trade off between large margin and small spread of the projection of the data. The RMM can be formulated as the following model:

min \frac{1}{2} ‖ w ‖ 2 + C \sum_{i = 1}^{m} L_{hinge} (1 - y_{i} (w T x_{i} + b)) s . t . \frac{1}{2} ((w \cdot Φ (x_{i}) + b) 2 \leq \frac{R 2}{2} i = 1, \dots, m

(2)

where

L_{hinge}

is the hinge loss and is defined as

L_{hinge} (u) = max {0, u}, \forall u \in ℜ

. This formulation is similar to the SVM primal except for the additional constraints

\frac{1}{2} ((w \cdot Φ (x_{i}) + b)) 2 \leq \frac{R 2}{2}

. The method proposes finding the large margin separating hyperplane such that the distance of the data from it is bounded from above by a scalar R. In this way, a large relative margin, rather than the absolute (classic) large margin solution is reached. The main idea in the vector case is given in Figure 1. It can be seen that the spread of the projections of the data on the hyperplane that is perpendicular to the separating hyperplane can be large in the case of the solution given by the classic SVM, where is bounded from above by R in the case of the solution given by RMM. This means, that for a given margin, the projections in the case of RMM are more compact, and therefore the classes better separated.

Figure 1.

An illustration of RMM.

The formulation has one extra parameter R in addition to the SVM parameter C. When R is large enough, the above RMM gives the same solution as the SVM. Also note that only settings of $R > 1$ are meaningful since a value of R less than one would prevent any training examples from clearing the margin. As R is decreased, the RMM solution increasingly differs from the SVM solution. Specifically, with a smaller R, the RMM still finds a large margin solution but with a smaller projection of the training examples.

RRMM

Pinball loss

The pinball loss function is given as follows, which can be regard as a generalized $ℓ_{1}$ loss.

L_{τ} (w; (x, y)) = {1 - yf (x) yf (x) \leq 1, τ (yf (x) - 1) yf (x) > 1

where the reasonable range of τ is

[0, 1]

as explained in Huang et al.¹⁵ The pinball loss

L_{τ}

has been applied for quantile regression, see, e.g. Koenker¹⁶ and Christmann and Steinwart.¹⁷ Huang et al.¹⁵ derived quantile classification using the idea of SVM. It has been shown that pinball and hinge loss SVM have similar computational complexity and consistency property. Besides, the pinball loss SVM is less sensitive to noise around the boundary.

The idea of RRMM can be formulated into the following optimization problem:

min \frac{1}{2} ‖ w ‖ 2 + C \sum_{i = 1}^{m} L_{τ} (1 - y_{i} (w T z_{i} + b)) s . t . \frac{1}{2} (w \cdot z_{i} + b) 2 \leq \frac{R 2}{2} i = 1, \dots, m

(3)

where

z_{i} = Φ (x_{i})

, and

Φ : X \to F

is a nonlinear mapping transforming the examples in the input space into the feature space.

The above formulation will not be solved since it is computationally impractical. Solving the dual of (3) requires semi-definite programming, which prevents the method from scaling beyond a few hundred data points. Instead, we propose an equivalent optimization, which gives the same solution but only requires quadratic programming. This is achieved by simply replacing the constraint $\frac{1}{2} (w \cdot z_{i} + b) 2 \leq \frac{R 2}{2}$ with the two equivalent linear constraint: $(w \cdot z_{i} + b) \leq R$ and $- (w \cdot z_{i} + b) \leq R$ . With these linear constraints replacing the quadratic constraint, the problem was merely a quadratic programming.

min \frac{1}{2} ‖ w ‖ 2 + C \sum_{i = 1}^{m} L_{τ} (1 - y_{i} (w T z_{i} + b)) s . t . w T z_{i} + b \leq R i = 1, \dots, m - w T z_{i} - b \leq R i = 1, \dots, m

(4)

The problem is further equivalently transformed into

min \frac{1}{2} ‖ w ‖ 2 + C \sum_{i = 1}^{m} ξ_{i} s . t . y_{i} (w \cdot z_{i} + b) \geq 1 - ξ_{i}, i = 1, \dots, m y_{i} (w \cdot z_{i} + b) \leq 1 + \frac{1}{τ} ξ_{i}, i = 1, \dots, m w T z_{i} + b \leq R i = 1, \dots, m - w T z_{i} - b \leq R i = 1, \dots, m

(5)

Parameter C controls the trade-off between the complexity and proportion of non-separable samples. Notice that when $τ = 0$ , the second constraint becomes $ξ_{i} \geq 0$ and the RRMM reduces to hinge loss RMM.

In order to solve this problem, we construct the Lagrangian function

L (w, b, ξ, α, β, γ, μ) = \frac{1}{2} ‖ w ‖ 2 + C \sum_{i = 1}^{m} ξ_{i} - \sum_{i = 1}^{m} α_{i} [y_{i} (w T z_{i} + b) - 1 + ξ_{i}] - \sum_{i = 1}^{m} β_{i} [- y_{i} (w T z_{i} + b) + 1 + \frac{1}{τ} ξ_{i}] - \sum_{i = 1}^{m} γ_{i} [- w T z_{i} - b + R] - \sum_{i = 1}^{m} μ_{i} [w T z_{i} + b + R]

(6)

Minimize it with respect to $w, b, ξ_{i}$ and maximize with respect to Lagrange multipliers $α_{i} \geq 0, β_{i} \geq 0, γ_{i} \geq 0, μ_{i} \geq 0$ . According to the KKT necessary and sufficient optimality conditions, we differentiate the primal variables $w, b$ and $ξ_{i}$ and equate them to zero for L resulting in

\frac{\partial L}{\partial w} = w - \sum_{i = 1}^{m} α_{i} y_{i} z_{i} + \sum_{i = 1}^{m} β_{i} y_{i} z_{i} + \sum_{i = 1}^{m} γ_{i} z_{i} - \sum_{i = 1}^{m} μ_{i} z_{i} = 0 \Rightarrow w = \sum_{i = 1}^{m} (α_{i} y_{i} - β_{i} y_{i} - γ_{i} + μ_{i}) z_{i}

(7)

\frac{\partial L}{\partial b} = - \sum_{i = 1}^{m} α_{i} y_{i} + \sum_{i = 1}^{m} β_{i} y_{i} + \sum_{i = 1}^{m} γ_{i} - \sum_{i = 1}^{m} μ_{i} = 0 \Rightarrow \sum_{i = 1}^{m} (α_{i} y_{i} - β_{i} y_{i} - γ_{i} + μ_{i}) = 0

(8)

\frac{\partial L}{\partial ξ_{i}} = 0 \Rightarrow C - α_{i} - \frac{1}{τ} β_{i} = 0

(9)

Substituting (7) into the Lagrangian function (6) and combining with (8) and (9), we have the dual of optimization problem (5) as follows.

min \frac{1}{2} \sum_{i = 1}^{m} \sum_{j = 1}^{m} (α_{i} y_{i} - β_{i} y_{i} - γ_{i} + μ_{i}) (α_{j} y_{j} - β_{j} y_{j} - γ_{j} + μ_{j}) \times k (x_{i}, x_{j}) - \sum_{i = 1}^{m} (α_{i} - β_{i}) + R \sum_{i = 1}^{m} (γ_{i} + μ_{i}) s . t . \sum_{i = 1}^{m} (α_{i} y_{i} - β_{i} y_{i} - γ_{i} + μ_{i}) = 0, i = 1, \dots, m α_{i} + \frac{1}{τ} β_{i} = C, i = 1, \dots, m α_{i} \geq 0, β_{i} \geq 0, γ_{i} \geq 0, μ_{i} \geq 0, i = 1, \dots, m

(10)

By solving the above dual problem, we obtain the Lagrangian vector $α, β, γ, μ$ , and the vector w can be achieved according to (7) (b is determined by the KKT conditions just as with SVM). Then the corresponding decision surface can be written as follows:

f (x) = sign (w T z + b) = sign (\sum_{i = 1}^{m} (α_{i} - β_{i}) y_{i} k (x_{i}, x) - \sum_{i = 1}^{m} (γ_{i} - μ_{i}) k (x_{i}, x) + b)

where

k (x_{i}, x) = (z_{i}, z)

is kernel function.

Experiments

In the experiment, we compared the performance of the RRMM with the hinge loss SVM, Pinball-SVM (PSVM) and RMM on several publicly available data sets from UCI machine learning repository.¹⁸ Numerous experiments were run on these datasets to assess the impact of the existence of noise on classification accuracy.

For most of the datasets we used, they don’t actually contain noise, so we use manual mechanisms to add the attribute noise. We let the attributes corrupted by zero-mean Gaussian noise. We use this noise to construct a noisy training set. In this experiment, the test set is clean. For each attribute, the ratio of the variance of noise to that of the attributes, denoted as r, is set to be 0 (i.e. noise-free), 0.03, and 0.1.

In each algorithm, the optimal parameter C is searched from set ${2 i | i = - 6, - 5, \dots, 10}$ . Gaussian kernel function $k (x_{i}, x_{j}) = \exp (- ‖ x_{i} - x_{j} ‖ 2 / σ 2)$ is considered in the experiment, as it is often applied and yields great generalization performances. The optimal Gaussian kernel parameter σ in each algorithm is selected over the range ${2 i | i = - 6, - 5, \dots, 10}$ . For parameter R, one can always find an $R_{max}$ such that RMM and SVM become identical for all $R \geq R_{max}$ . One approach to find this upper bound $R_{max}$ of useful ranges from a set of training examples is to train an ordinary SVM on the training set. $R_{max}$ is the highest occurring absolute value of the classification function applied on samples in the training set; every R above $R_{max}$ has no influence whatsoever. So for parameter optimization, only the interval $[1, R_{max}]$ has to be observed.

The performance of these algorithms heavily depends on the choices parameters. In order to seek them out, we tune the parameters based on tenfold cross validation. That is to say, the data set is randomly split into 10 subsets, and one of those sets is reserved as a test set, and the others are reserved as a train set. This process is repeated 10 times, and the average of 10 testing results is used as the performance measure. While keeping the proportion of examples in each class constant, the best parameters corresponding to the lowest testing errors are used to predict the corresponding testing set. On the other hand, the dual formulation of the RRMM has $4 m$ variables, $m + 1$ equality constraints, and $4 m$ inequality constraints. Meanwhile, in the RMM, there are $3 m$ variables, $1$ equality constraints, and $4 m$ inequality constraints. The computational complexity of the RRMM is higher than the other three algorithms. In this article, we use CVX solver embedded in Matlab to solve all the algorithms but leave the study on efficient algorithms for further work.

The experimental results of four methods are summarized in Table 1. It consist of the average accuracy and its standard deviation. Generally, the RRMM achieves satisfactory results, which means that the average accuracy is high and the standard deviation is small. A small deviation means that the RRMM is not sensitive to noise.

Table 1.

Experimental comparison of SVM, RMM, PSVM, and RRMM on classification accuracy and standard deviation.

				PSVM			RRMM
Data set	r	SVM	RMM	$τ = 0.1$	$τ = 0.3$	$τ = 1$	$τ = 0.1$	$τ = 0.3$	$τ = 1$
Statlog	$0$	$0.7983 (0.0641)$	$0.8057 (0.0591)$
	$0$	$0.7983 (0.0641)$	$0.8057 (0.0591)$	$0.8073 (0.0725)$	$0.8095 (0.0587)$	$0.8098 (0.0657)$	$0.8354 (0.0594)$	$0.8245 (0.0487)$	$0.8393 (0.0688)$
	$0.03$	$0.7832 (0.0859)$	$0.7983 (0.0641)$	$0.7921 (0.0635)$	$0.7948 (0.0223)$	$0.7945 (0.0596)$	$0.8133 (0.0419)$	$0.8244 (0.0562)$	$0.8203 (0.0570)$
	$0.1$	$0.7945 (0.0601)$	$0.8020 (0.0635)$	$0.7948 (0.0223)$	$0.8161 (0.0056)$	$0.7945 (0.0658)$	$0.8317 (0.0647)$	$0.8131 (0.0560)$	$0.8156 (0.0556)$
Breast	$0$	$0.9433 (0.0641)$	$0.9600 (0.0190)$	$0.9667 (0.0901)$	$0.9633 (0.0183)$	$0.9567 (0.0149)$	$0.9633 (0.0247)$	$0.9700 (0.01839)$	$0.9733 (0.0190)$
	$0.03$	$0.9367 (0.0442)$	$0.9467 (0.0190)$	$0.9467 (0.0204)$	$0.9500 (0.0183)$	$0.9500 (0.0149)$	$0.9600 (0.0298)$	$0.9567 (0.0149)$	$0.9500 (0.0312)$
	$0.1$	$0.9267 (0.0383)$	$0.9500 (0.0333)$	$0.9633 (0.0183)$	$0.9600 (0.0190)$	$0.9567 (0.0253)$	$0.9633 (0.0190)$	$0.9567 (0.0325)$	$0.9667 (0.0149)$
Hepatitis	$0$	$0.8250 (0.1027)$	$0.8375 (0.0948)$	$0.8321 (0.0512)$	$0.8319 (0.0391)$	$0.8319 (0.0391)$	$0.8750 (0.0126)$	$0.8625 (0.0815)$	$0.8500 (0.0713)$
	$0.03$	$0.7750 (0.0046)$	$0.8125 (0.1169)$	$0.8058 (0.0562)$	$0.8056 (0.0884)$	$0.7928 (0.0610)$	$0.8375 (0.0118)$	$0.8250 (0.0283)$	$0.8625 (0.0815)$
	$0.1$	$0.7875 (0.0948)$	$0.8250 (0.0927)$	$0.8071 (0.0728)$	$0.8256 (0.0434)$	$0.8300 (0.0931)$	$0.8375 (0.0948)$	$0.8500 (0.0766)$	$0.8438 (0.0547)$
Wine	$0$	$0.9549 (0.0252)$	$0.9605 (0.0254)$	$0.9608 (0.0386)$	$0.9608 (0.0386)$	$0.9625 (0.0428)$	$0.9717 (0.0282)$	$0.9675 (0.0913)$	$0.9663 (0.0303)$
	$0.03$	$0.9327 (0.0146)$	$0.9443 (0.0651)$	$0.9407 (0.0767)$	$0.9473 (0.0868)$	$0.9460 (0.0783)$	$0.9495 (0.0368)$	$0.9606 (0.0155)$	$0.9549 (0.0252)$
	$0.1$	$0.9270 (0.0554)$	$0.9329 (0.0462)$	$0.9549 (0.0257)$	$0.9379 (0.0317)$	$0.9546 (0.0385)$	$0.9440 (0.0345)$	$0.9608 (0.0319)$	$0.9492 (0.0242)$
Pima	$0$	$0.7677 (0.0672)$	$0.7729 (0.0600)$	$0.7751 (0.0448)$	$0.7707 (0.0385)$	$0.7683 (0.0783)$	$0.7754 (0.0770)$	$0.7804 (0.0492)$	$0.7781 (0.0138)$
	$0.03$	$0.7500 (0.0589)$	$0.7449 (0.0687)$	$0.7622 (0.0611)$	$0.7651 (0.0540)$	$0.7701 (0.0486)$	$0.7806 (0.0455)$	$0.7856 (0.0587)$	$0.7730 (0.0266)$
	$0.1$	$0.7525 (0.0223)$	$0.7651 (0.0468)$	$0.7568 (0.0647)$	$0.7583 (0.0233)$	$0.7594 (0.0601)$	$0.7630 (0.0616)$	$0.7756 (0.0587)$	$0.7754 (0.0646)$
Abalone	$0$	$0.7933 (0.0279)$	$0.7667 (0.0236)$	$0.7933 (0.0723)$	$0.8000 (0.0408)$	$0.8000 (0.0408)$	$0.8033 (0.0837)$	$0.7933 (0.0163)$	$0.8000 (0.0408)$
	$0.03$	$0.7867 (0.0606)$	$0.7733 (0.0641)$	$0.7867 (0.0506)$	$0.7900 (0.0931)$	$0.7933 (0.0691)$	$0.8084 (0.0631)$	$0.8267 (0.0365)$	$0.8200 (0.0435)$
	$0.1$	$0.7800 (0.0298)$	$0.7533 (0.0960)$	$0.7867 (0.0767)$	$0.7806 (0.0548)$	$0.7933 (0.0863)$	$0.8133 (0.0558)$	$0.8000 (0.0408)$	$0.7933 (0.0641)$
South African heart disease	$0$	$0.6767 (0.0805)$	$0.6800 (0.0650)$	$0.6867 (0.0149)$	$0.6933 (0.0596)$	$0.6800 (0.0224)$	$0.7033 (0.0639)$	$0.6900 (0.0608)$	$0.6900 (0.0190)$
	$0.03$	$0.6600 (0.0838)$	$0.6867 (0.0570)$	$0.6733 (0.0384)$	$0.6633 (0.0451)$	$0.6733 (0.0303)$	$0.6967 (0.0570)$	$0.6933 (0.0596)$	$0.6967 (0.0681)$
	$0.1$	$0.6567 (0.0732)$	$0.6600 (0.0641)$	$0.6700 (0.0749)$	$0.6767 (0.0480)$	$0.6633 (0.0380)$	$0.6800 (0.0650)$	$0.6867 (0.0660)$	$0.6733 (0.0278)$
Haberman’s survival data	$0$	$0.7125 (0.0546)$	$0.7318 (0.0754)$	$0.7221 (0.0512)$	$0.7223 (0.0523)$	$0.7319 (0.0391)$	$0.7385 (0.0482)$	$0.7377 (0.0196)$	$0.7385 (0.0482)$
	$0.03$	$0.7084 (0.0182)$	$0.7254 (0.0654)$	$0.7158 (0.0562)$	$0.7156 (0.0884)$	$0.7228 (0.0621)$	$0.7319 (0.0558)$	$0.7450 (0.0478)$	$0.7352 (0.0599)$
	$0.1$	$0.7033 (0.0184)$	$0.7220 (0.0732)$	$0.7056 (0.0434)$	$0.7082 (0.0931)$	$0.7120 (0.0713)$	$0.7244 (0.0866)$	$0.7469 (0.0140)$	$0.7486 (0.0250)$
Wpbc	$0$	$0.7521 (0.0807)$	$0.7673 (0.0923)$	$0.7626 (0.0762)$	$0.7671 (0.0944)$	$0.7668 (0.0831)$	$0.7727 (0.0614)$	$0.7773 (0.0144)$	$0.7830 (0.0806)$
	$0.03$	$0.7358 (0.0502)$	$0.7467 (0.0154)$	$0.7472 (0.0991)$	$0.7472 (0.0991)$	$0.7368 (0.0679)$	$0.7517 (0.0349)$	$0.7623 (0.0338)$	$0.7671 (0.0203)$
	$0.1$	$0.7101 (0.0623)$	$0.7470 (0.0985)$	$0.7223 (0.1038)$	$0.7300 (0.0679)$	$0.7363 (0.0992)$	$0.7511 (0.0603)$	$0.7517 (0.0349)$	$0.7600 (0.0228)$

Conclusions and discussion

In this article, we introduced the pinball loss to classification problems, resulting in the robust relative margin SVM. The dual formulation of the RRMM is given. Compared with the hinge loss RMM, the major advantage of the proposed method is that the RRMM is less sensitive to noise, especially the feature noise around the decision boundary.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Natural Science Foundation of China (No. 61170174) and Tianjin Training plan of University Innovation Team (No.TD 12-5016).

References

Shivaswamy PK and Jebara T. Relative margin machines. In: Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 2008, pp.1481–1488.

Shivaswamy

Jebara

. Maximum relative margin and data-dependent regularization. J Mach Learn Res 2010; 11: 747–788.

Huang

Liu

. Fuzzy support vector machines for pattern recognition and data mining. Int J Fuzzy Syst 2012; 4: 3–12.

Suykens

DeBrabanter

Lukas

. Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing 2002; 48: 85–105.

Wang

. A weighted twin support vector regression. Knowl-Based Syst 2012; 33: 92–101.

Yang

Chou

Lian

. Robust classifier learning with fuzzy class labels for large-margin support vector machines. Neurocomputing 2013; 99: 1–14.

Huang

Song

. Robust support vector regression for uncertain input and output data. IEEE Trans Neural Netw Learn Syst 2012; 23: 1690–1700.

Trafalis

Gilbert

. Robust classification and regression using support vector machines. Eur J Operat Res 2006; 173: 893–909.

Zhong

Fukushima

. Second order cone programming formulations for robust multi-class classification. Neural Comput 2007; 19: 258–282.

10.

Feng

Williams

. The generalization error of the symmetric and scaled support vector machines. IEEE Trans Neural Netw 2001; 12: 1255–1260.

11.

Yoon

Yun

Nakayama

. A role of total margin in support vector machines. Proc Int Joint Conf Neural Netw 2003; 3: 2049–2053.

12.

Jin

Tang

Zhang

. Support vector machines with genetic fuzzy feature transformation for biomedical data classification. Inform Sci 2007; 177: 476–489.

13.

Lingras

Butz

. Rough set based 1-v-1 and 1-v-r approaches to support vector machine multi-classification. Inform Sci 2007; 177: 3782–3798.

14.

Zhang

Wang

. A rough margin based support vector machine. Inform Sci 2008; 178: 2204–2214.

15.

Huang

Shi

Suykens

JAK

. Support vector machine classifier with pinball loss. IEEE Trans Pattern Anal Mach Intell 2014; 36: 5.

16.

Koenker R. Quantile regression. Econometric Society Monograph Series 38, 2005.

17.

Christmann

Steinwart

. How SVMs can estimate quantiles and the median. Advances in Neural Information Processing Systems 2007; 20: 305–312.

18.

Blake C, et al. UCI repository of machine learning database, http://archive.ics.uci.edu/ml/ (accessed 9 November 2016).