Hinge loss - Reference.org

On this page

Hinge loss

In machine learning, a loss function used for maximum‐margin classification

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as

ℓ ( y ) = max ( 0 , 1 − t ⋅ y ) {\displaystyle \ell (y)=\max(0,1-t\cdot y)}

Note that y {\displaystyle y} should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs, y = w ⋅ x + b {\displaystyle y=\mathbf {w} \cdot \mathbf {x} +b} , where ( w , b ) {\displaystyle (\mathbf {w} ,b)} are the parameters of the hyperplane and x {\displaystyle \mathbf {x} } is the input variable(s).

When t and y have the same sign (meaning y predicts the right class) and | y | ≥ 1 {\displaystyle |y|\geq 1} , the hinge loss ℓ ( y ) = 0 {\displaystyle \ell (y)=0} . When they have opposite signs, ℓ ( y ) {\displaystyle \ell (y)} increases linearly with y, and similarly if | y | < 1 {\displaystyle |y|<1} , even if it has the same sign (correct prediction, but not by enough margin).

Related Image Collections Add Image

Profiles

1 Image

We don't have any YouTube videos related to Hinge loss yet.

You can add one yourself here.

We don't have any PDF documents related to Hinge loss yet.

You can add one yourself here.

We don't have any Books related to Hinge loss yet.

You can add one yourself here.

We don't have any archived web articles related to Hinge loss yet.

You can submit a link to a page to archive here.

Extensions

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,² it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.³ For example, Crammer and Singer⁴ defined it for a linear classifier as⁵

ℓ ( y ) = max ( 0 , 1 + max y ≠ t w y x − w t x ) {\displaystyle \ell (y)=\max(0,1+\max _{y\neq t}\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )} ,

where t {\displaystyle t} is the target label, w t {\displaystyle \mathbf {w} _{t}} and w y {\displaystyle \mathbf {w} _{y}} are the model parameters.

Weston and Watkins provided a similar definition, but with a sum rather than a max:⁶ ⁷

ℓ ( y ) = ∑ y ≠ t max ( 0 , 1 + w y x − w t x ) {\displaystyle \ell (y)=\sum _{y\neq t}\max(0,1+\mathbf {w} _{y}\mathbf {x} -\mathbf {w} _{t}\mathbf {x} )} .

In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss:

ℓ ( y ) = max ( 0 , Δ ( y , t ) + ⟨ w , ϕ ( x , y ) ⟩ − ⟨ w , ϕ ( x , t ) ⟩ ) = max ( 0 , max y ∈ Y ( Δ ( y , t ) + ⟨ w , ϕ ( x , y ) ⟩ ) − ⟨ w , ϕ ( x , t ) ⟩ ) {\displaystyle {\begin{aligned}\ell (\mathbf {y} )&=\max(0,\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle -\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\\&=\max(0,\max _{y\in {\mathcal {Y}}}\left(\Delta (\mathbf {y} ,\mathbf {t} )+\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {y} )\rangle \right)-\langle \mathbf {w} ,\phi (\mathbf {x} ,\mathbf {t} )\rangle )\end{aligned}}} .

Optimization

The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function y = w ⋅ x {\displaystyle y=\mathbf {w} \cdot \mathbf {x} } that is given by

∂ ℓ ∂ w i = { − t ⋅ x i if t ⋅ y < 1 , 0 otherwise . {\displaystyle {\frac {\partial \ell }{\partial w_{i}}}={\begin{cases}-t\cdot x_{i}&{\text{if }}t\cdot y<1,\\0&{\text{otherwise}}.\end{cases}}}

However, since the derivative of the hinge loss at t y = 1 {\displaystyle ty=1} is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's⁸

ℓ ( y ) = { 1 2 − t y if t y ≤ 0 , 1 2 ( 1 − t y ) 2 if 0 < t y < 1 , 0 if 1 ≤ t y {\displaystyle \ell (y)={\begin{cases}{\frac {1}{2}}-ty&{\text{if}}~~ty\leq 0,\\{\frac {1}{2}}(1-ty)^{2}&{\text{if}}~~0<ty<1,\\0&{\text{if}}~~1\leq ty\end{cases}}}

or the quadratically smoothed

ℓ γ ( y ) = { 1 2 γ max ( 0 , 1 − t y ) 2 if t y ≥ 1 − γ , 1 − γ 2 − t y otherwise {\displaystyle \ell _{\gamma }(y)={\begin{cases}{\frac {1}{2\gamma }}\max(0,1-ty)^{2}&{\text{if}}~~ty\geq 1-\gamma ,\\1-{\frac {\gamma }{2}}-ty&{\text{otherwise}}\end{cases}}}

suggested by Zhang.⁹ The modified Huber loss L {\displaystyle L} is a special case of this loss function with γ = 2 {\displaystyle \gamma =2} , specifically L ( t , y ) = 4 ℓ 2 ( y ) {\displaystyle L(t,y)=4\ell _{2}(y)} .

References

Rosasco, L.; De Vito, E. D.; Caponnetto, A.; Piana, M.; Verri, A. (2004). "Are Loss Functions All the Same?" (PDF). Neural Computation. 16 (5): 1063–1076. CiteSeerX 10.1.1.109.6786. doi:10.1162/089976604773135104. PMID 15070510. http://web.mit.edu/lrosasco/www/publications/loss.pdf ↩
Duan, K. B.; Keerthi, S. S. (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study" (PDF). Multiple Classifier Systems. LNCS. Vol. 3541. pp. 278–285. CiteSeerX 10.1.1.110.6789. doi:10.1007/11494683_28. ISBN 978-3-540-26306-7. 978-3-540-26306-7 ↩
Doğan, Ürün; Glasmachers, Tobias; Igel, Christian (2016). "A Unified View on Multi-class Support Vector Classification" (PDF). Journal of Machine Learning Research. 17: 1–32. http://www.jmlr.org/papers/volume17/11-229/11-229.pdf ↩
Crammer, Koby; Singer, Yoram (2001). "On the algorithmic implementation of multiclass kernel-based vector machines" (PDF). Journal of Machine Learning Research. 2: 265–292. http://jmlr.csail.mit.edu/papers/volume2/crammer01a/crammer01a.pdf ↩
Moore, Robert C.; DeNero, John (2011). "L1 and L2 regularization for multiclass hinge loss models" (PDF). Proc. Symp. on Machine Learning in Speech and Language Processing. Archived from the original (PDF) on 2017-08-28. Retrieved 2013-10-23. https://web.archive.org/web/20170828233715/http://www.ttic.edu/sigml/symposium2011/papers/Moore+DeNero_Regularization.pdf ↩
Weston, Jason; Watkins, Chris (1999). "Support Vector Machines for Multi-Class Pattern Recognition" (PDF). European Symposium on Artificial Neural Networks. Archived from the original (PDF) on 2018-05-05. Retrieved 2017-03-01. https://web.archive.org/web/20180505024710/https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf ↩
Doğan, Ürün; Glasmachers, Tobias; Igel, Christian (2016). "A Unified View on Multi-class Support Vector Classification" (PDF). Journal of Machine Learning Research. 17: 1–32. http://www.jmlr.org/papers/volume17/11-229/11-229.pdf ↩
Rennie, Jason D. M.; Srebro, Nathan (2005). Loss Functions for Preference Levels: Regression with Discrete Ordered Labels (PDF). Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling. http://ttic.uchicago.edu/~nati/Publications/RennieSrebroIJCAI05.pdf ↩
Zhang, Tong (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms (PDF). ICML. http://tongzhang-ml.org/papers/icml04-stograd.pdf ↩

Extensions

Optimization

See also

References