For large positive values of the parameter α > 0 {\displaystyle \alpha >0} , the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.
S α {\displaystyle {\mathcal {S}}_{\alpha }} has the following properties:
The gradient of S α {\displaystyle {\mathcal {S}}_{\alpha }} is closely related to softmax and is given by
This makes the softmax function useful for optimization techniques that use gradient descent.
This operator is sometimes called the Boltzmann operator,1 after the Boltzmann distribution.
Main article: LogSumExp
Another smooth maximum is LogSumExp:
This can also be normalized if the x i {\displaystyle x_{i}} are all non-negative, yielding a function with domain [ 0 , ∞ ) n {\displaystyle [0,\infty )^{n}} and range [ 0 , ∞ ) {\displaystyle [0,\infty )} :
The ( n − 1 ) {\displaystyle (n-1)} term corrects for the fact that exp ( 0 ) = 1 {\displaystyle \exp(0)=1} by canceling out all but one zero exponential, and log 1 = 0 {\displaystyle \log 1=0} if all x i {\displaystyle x_{i}} are zero.
The mellowmax operator2 is defined as follows:
It is a non-expansive operator. As α → ∞ {\displaystyle \alpha \to \infty } , it acts like a maximum. As α → 0 {\displaystyle \alpha \to 0} , it acts like an arithmetic mean. As α → − ∞ {\displaystyle \alpha \to -\infty } , it acts like a minimum. This operator can be viewed as a particular instantiation of the quasi-arithmetic mean. It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.3
Main article: P-norm
Another smooth maximum is the p-norm:
which converges to ‖ ( x 1 , … , x n ) ‖ ∞ = max 1 ≤ i ≤ n | x i | {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{\infty }=\max _{1\leq i\leq n}|x_{i}|} as p → ∞ {\displaystyle p\to \infty } .
An advantage of the p-norm is that it is a norm. As such it is scale invariant (homogeneous): ‖ ( λ x 1 , … , λ x n ) ‖ p = | λ | ⋅ ‖ ( x 1 , … , x n ) ‖ p {\displaystyle \|(\lambda x_{1},\ldots ,\lambda x_{n})\|_{p}=|\lambda |\cdot \|(x_{1},\ldots ,x_{n})\|_{p}} , and it satisfies the triangle inequality.
The following binary operator is called the Smooth Maximum Unit (SMU):4
where ε ≥ 0 {\displaystyle \varepsilon \geq 0} is a parameter. As ε → 0 {\displaystyle \varepsilon \to 0} , | ⋅ | ε → | ⋅ | {\displaystyle |\cdot |_{\varepsilon }\to |\cdot |} and thus max ε → max {\displaystyle \textstyle \max _{\varepsilon }\to \max } .
https://www.johndcook.com/soft_maximum.pdf
M. Lange, D. Zühlke, O. Holz, and T. Villmann, "Applications of lp-norms and their smooth approximations for gradient based learning vector quantization," in Proc. ESANN, Apr. 2014, pp. 271-276. (https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2014-153.pdf)
Asadi, Kavosh; Littman, Michael L. (2017). "An Alternative Softmax Operator for Reinforcement Learning". PMLR. 70: 243–252. arXiv:1612.05628. Retrieved January 6, 2023. /wiki/Michael_L._Littman ↩
Safak, Aysel (February 1993). "Statistical analysis of the power sum of multiple correlated log-normal components". IEEE Transactions on Vehicular Technology. 42 (1): {58–61. doi:10.1109/25.192387. Retrieved January 6, 2023. https://ieeexplore.ieee.org/document/192387 ↩
Biswas, Koushik; Kumar, Sandeep; Banerjee, Shilpak; Ashish Kumar Pandey (2021). "SMU: Smooth activation function for deep networks using smoothing maximum technique". arXiv:2111.04682 [cs.LG]. /wiki/ArXiv_(identifier) ↩