In mathematics , a smooth maximum of an indexed family x 1 , ..., x n of numbers is a smooth approximation to the maximum function max ( x 1 , … , x n ) , {\displaystyle \max(x_{1},\ldots ,x_{n}),} meaning a parametric family of functions m α ( x 1 , … , x n ) {\displaystyle m_{\alpha }(x_{1},\ldots ,x_{n})} such that for every α , the function m α {\displaystyle m_{\alpha }} is smooth, and the family converges to the maximum function m α → max {\displaystyle m_{\alpha }\to \max } as α → ∞ {\displaystyle \alpha \to \infty } . The concept of smooth minimum is similarly defined. In many cases, a single family approximates both: maximum as the parameter goes to positive infinity, minimum as the parameter goes to negative infinity; in symbols, m α → max {\displaystyle m_{\alpha }\to \max } as α → ∞ {\displaystyle \alpha \to \infty } and m α → min {\displaystyle m_{\alpha }\to \min } as α → − ∞ {\displaystyle \alpha \to -\infty } . The term can also be used loosely for a specific smooth function that behaves similarly to a maximum, without necessarily being part of a parametrized family.
Examples Boltzmann operator Smoothmax of (−x, x) versus x for various parameter values. Very smooth for α {\displaystyle \alpha } =0.5, and more sharp for α {\displaystyle \alpha } =8. For large positive values of the parameter α > 0 {\displaystyle \alpha >0} , the following formulation is a smooth, differentiable approximation of the maximum function. For negative values of the parameter that are large in absolute value, it approximates the minimum.
S α ( x 1 , … , x n ) = ∑ i = 1 n x i e α x i ∑ i = 1 n e α x i {\displaystyle {\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {\sum _{i=1}^{n}x_{i}e^{\alpha x_{i}}}{\sum _{i=1}^{n}e^{\alpha x_{i}}}}} S α {\displaystyle {\mathcal {S}}_{\alpha }} has the following properties:
S α → max {\displaystyle {\mathcal {S}}_{\alpha }\to \max } as α → ∞ {\displaystyle \alpha \to \infty } S 0 {\displaystyle {\mathcal {S}}_{0}} is the arithmetic mean of its inputs S α → min {\displaystyle {\mathcal {S}}_{\alpha }\to \min } as α → − ∞ {\displaystyle \alpha \to -\infty } The gradient of S α {\displaystyle {\mathcal {S}}_{\alpha }} is closely related to softmax and is given by
∇ x i S α ( x 1 , … , x n ) = e α x i ∑ j = 1 n e α x j [ 1 + α ( x i − S α ( x 1 , … , x n ) ) ] . {\displaystyle \nabla _{x_{i}}{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n})={\frac {e^{\alpha x_{i}}}{\sum _{j=1}^{n}e^{\alpha x_{j}}}}[1+\alpha (x_{i}-{\mathcal {S}}_{\alpha }(x_{1},\ldots ,x_{n}))].} This makes the softmax function useful for optimization techniques that use gradient descent .
This operator is sometimes called the Boltzmann operator,[1] after the Boltzmann distribution .
LogSumExp Another smooth maximum is LogSumExp :
L S E α ( x 1 , … , x n ) = 1 α log ∑ i = 1 n exp α x i {\displaystyle \mathrm {LSE} _{\alpha }(x_{1},\ldots ,x_{n})={\frac {1}{\alpha }}\log \sum _{i=1}^{n}\exp \alpha x_{i}} This can also be normalized if the x i {\displaystyle x_{i}} are all non-negative, yielding a function with domain [ 0 , ∞ ) n {\displaystyle [0,\infty )^{n}} and range [ 0 , ∞ ) {\displaystyle [0,\infty )} :
g ( x 1 , … , x n ) = log ( ∑ i = 1 n exp x i − ( n − 1 ) ) {\displaystyle g(x_{1},\ldots ,x_{n})=\log \left(\sum _{i=1}^{n}\exp x_{i}-(n-1)\right)} The ( n − 1 ) {\displaystyle (n-1)} term corrects for the fact that exp ( 0 ) = 1 {\displaystyle \exp(0)=1} by canceling out all but one zero exponential, and log 1 = 0 {\displaystyle \log 1=0} if all x i {\displaystyle x_{i}} are zero.
Mellowmax The mellowmax operator[1] is defined as follows:
m m α ( x ) = 1 α log 1 n ∑ i = 1 n exp α x i {\displaystyle \mathrm {mm} _{\alpha }(x)={\frac {1}{\alpha }}\log {\frac {1}{n}}\sum _{i=1}^{n}\exp \alpha x_{i}} It is a non-expansive operator. As α → ∞ {\displaystyle \alpha \to \infty } , it acts like a maximum. As α → 0 {\displaystyle \alpha \to 0} , it acts like an arithmetic mean. As α → − ∞ {\displaystyle \alpha \to -\infty } , it acts like a minimum. This operator can be viewed as a particular instantiation of the quasi-arithmetic mean . It can also be derived from information theoretical principles as a way of regularizing policies with a cost function defined by KL divergence. The operator has previously been utilized in other areas, such as power engineering.[2]
p-Norm Another smooth maximum is the p-norm :
‖ ( x 1 , … , x n ) ‖ p = ( ∑ i = 1 n | x i | p ) 1 p {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{p}=\left(\sum _{i=1}^{n}|x_{i}|^{p}\right)^{\frac {1}{p}}} which converges to ‖ ( x 1 , … , x n ) ‖ ∞ = max 1 ≤ i ≤ n | x i | {\displaystyle \|(x_{1},\ldots ,x_{n})\|_{\infty }=\max _{1\leq i\leq n}|x_{i}|} as p → ∞ {\displaystyle p\to \infty } .
An advantage of the p-norm is that it is a norm . As such it is scale invariant (homogeneous ): ‖ ( λ x 1 , … , λ x n ) ‖ p = | λ | ⋅ ‖ ( x 1 , … , x n ) ‖ p {\displaystyle \|(\lambda x_{1},\ldots ,\lambda x_{n})\|_{p}=|\lambda |\cdot \|(x_{1},\ldots ,x_{n})\|_{p}} , and it satisfies the triangle inequality .
Smooth maximum unit The following binary operator is called the Smooth Maximum Unit (SMU):[3]
max ε ( a , b ) = a + b + | a − b | ε 2 = a + b + ( a − b ) 2 + ε 2 {\displaystyle {\begin{aligned}\textstyle \max _{\varepsilon }(a,b)&={\frac {a+b+|a-b|_{\varepsilon }}{2}}\\&={\frac {a+b+{\sqrt {(a-b)^{2}+\varepsilon }}}{2}}\end{aligned}}} where ε ≥ 0 {\displaystyle \varepsilon \geq 0} is a parameter. As ε → 0 {\displaystyle \varepsilon \to 0} , | ⋅ | ε → | ⋅ | {\displaystyle |\cdot |_{\varepsilon }\to |\cdot |} and thus max ε → max {\displaystyle \textstyle \max _{\varepsilon }\to \max } .
See also References