Home

softmax

Softmax is a function that converts a vector of real numbers into a probability distribution over discrete classes. For a K-dimensional input z, the i-th output is sigma_i(z) = exp(z_i) / sum_{k=1}^K exp(z_k). The resulting values are nonnegative and sum to 1, making softmax a common final activation in multi-class classification. The two-class case reduces to the logistic function in a higher-dimensional form.

A common variant introduces a temperature parameter T: sigma_i(z; T) = exp(z_i / T) / sum_k exp(z_k / T). Higher

Numerical stability is important in practice. A standard trick is to subtract the maximum input: sigma_i(z) =

Key properties include invariance to adding a constant to all inputs, since exp(z_i + c) scales equally

Limitations include potential miscalibration of predicted probabilities and sensitivity to input scale; appropriate loss functions and

T
produces
a
softer
distribution,
while
lower
T
makes
it
more
peaked.
Softmax
is
differentiable
and
its
Jacobian
has
the
form
∂sigma_i/∂z_j
=
sigma_i(z)
(δ_ij
−
sigma_j(z)).
In
conjunction
with
cross-entropy
loss,
this
yields
a
convenient
gradient:
∂L/∂z
=
p
−
y,
where
p
=
sigma(z)
and
y
is
the
target
distribution.
exp(z_i
−
max_j
z_j)
/
sum_k
exp(z_k
−
max_j
z_j).
The
implementation
often
uses
the
log-sum-exp
technique
to
maintain
precision.
and
cancels
in
the
ratio.
Softmax
is
widely
used
as
the
final
activation
in
neural
networks
for
multi-class
classification,
in
attention
mechanisms
to
produce
probability
weights,
and
wherever
a
probability
distribution
over
categories
is
required.
regularization
help
mitigate
these
issues.