Original link: Loss Function | Cross-Entropy Loss Function
Loss Function | Cross-Entropy Loss Function
Xiao Feiyu
Most of my technical notes are posted on Zhihu; daily life updates appear on Douyin/Bilibili: “Xiao Feiyu’s Daily Life”
3,069 upvotes
This article discusses the Cross-Entropy loss function, commonly used in classification tasks. But why is it so effective for classification? Let’s begin with a simple classification example.
1. Image Classification Task
We aim to predict an animal’s category—cat, dog, or pig—based on features such as its outline and color in an image. Suppose we currently have two models (with different parameters), both outputting probability values for each class via sigmoid or softmax:
Model 1:
| Prediction | Ground Truth | Correct? |
|---|---|---|
| 0.3 0.3 0.4 | 0 0 1 (pig) | Correct |
| undefined | -— | -— |
| 0.3 0.4 0.3 | 0 1 0 (dog) | Correct |
| undefined | -— | -— |
| 0.1 0.2 0.7 | 1 0 0 (cat) | Incorrect |
| undefined | -— | -— |
Model 1 correctly classifies samples 1 and 2 by only a very narrow margin, while misclassifying sample 3 completely.
Model 2:
| Prediction | Ground Truth | Correct? |
|---|---|---|
| 0.1 0.2 0.7 | 0 0 1 (pig) | Correct |
| undefined | -— | -— |
| 0.1 0.7 0.2 | 0 1 0 (dog) | Correct |
| undefined | -— | -— |
| 0.3 0.4 0.3 | 1 0 0 (cat) | Incorrect |
| undefined | -— | -— |
Model 2 classifies samples 1 and 2 very accurately, and although it misclassifies sample 3, its prediction is relatively close—not wildly off.
Now that we have our models, we need to define a loss function to evaluate their performance on these samples. So what loss functions could we define?
1.1 Classification Error (Classification Error Rate)
The most straightforward definition of a loss function is:
Model 1:
Model 2:
We know that although both Model 1 and Model 2 misclassify one sample, Model 2 performs comparatively better—and its loss value should therefore be smaller. Unfortunately, the metric fails to reflect this distinction. While intuitive, this loss function thus performs poorly.
1.2 Mean Squared Error (MSE)
Mean Squared Error is another commonly used loss function, defined as:
Model 1:
Averaging over all samples:
Model 2:
Averaging over all samples:
We observe that MSE correctly identifies Model 2 as superior to Model 1. So why not adopt MSE as the loss function? The main reason is that, in classification tasks where sigmoid/softmax outputs probabilities, using MSE with gradient descent leads to extremely slow learning during early training stages (MSE loss function).
From the above intuitive analysis, it becomes clear that neither Classification Error nor MSE serves well as a loss function for classification tasks. Next, let’s examine how the Cross-Entropy loss function performs.
1.3 Cross-Entropy Loss Function
1.3.1 Mathematical Expression
(1) Binary Classification
In binary classification, the model must predict one of two outcomes. For each class, the predicted probabilities are and
. The expression (with natural logarithm base
) is:
Where:
— ground-truth label for sample
; 1 for positive class, 0 for negative class
— predicted probability that sample
belongs to the positive class
(2) Multi-Class Classification
Multi-class classification extends the binary case:
Where:
— number of classes
— indicator function (0 or 1): equals 1 if the true class of sample
is
, else 0
— predicted probability that sample
belongs to class
Now, applying this formula to compute the loss values for our earlier examples:
Model 1:
Averaging over all samples:
Model 2:
Compute the average loss across all samples:
The above computation can be implemented using Python’s sklearn library:
from sklearn.metrics import log_loss
y_true = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
y_pred_1 = [[0.3, 0.3, 0.4], [0.3, 0.4, 0.3], [0.1, 0.2, 0.7]]
y_pred_2 = [[0.1, 0.2, 0.7], [0.1, 0.7, 0.2], [0.3, 0.4, 0.3]]
print(log_loss(y_true, y_pred_1))
print(log_loss(y_true, y_pred_2))
____________
1.3783888522474517
0.6391075640678003
As observed, the cross-entropy loss function effectively captures the performance difference between Model 1 and Model 2.
2. Function Properties
As shown, this function is convex; thus, its derivative yields a globally optimal solution.
3. Learning Process
The cross-entropy loss function is commonly used in classification tasks—especially with neural networks—and is often adopted as the loss function for such problems. Moreover, since cross-entropy involves computing probabilities for each class, it almost always appears alongside the sigmoid (or softmax) function.
Let us examine the complete prediction, loss computation, and learning pipeline using the final-layer output of a neural network:
- The neural network’s final layer produces raw scores (also called logits) for each class;
- These scores are passed through the sigmoid (or softmax) function, yielding probability outputs;
- The predicted class probabilities are then fed into the cross-entropy loss function together with the one-hot encoded true labels.
Learning tasks fall into two categories: binary classification and multi-class classification. We discuss the learning process for both cases separately.
3.1 Binary Classification Case
Binary cross-entropy loss function learning process
As illustrated above, the derivative computation comprises three sub-processes, i.e., the product of three partial derivatives:
3.1.1 Computing the First Term: 
- denotes the predicted probability that sample
belongs to the positive class.
- is an indicator function: it equals
if sample
belongs to the positive class, and
otherwise.
3.1.2 Computing the Second Term: 
This term represents the derivative of the sigmoid function with respect to the score. First, recall the definition of the sigmoid function and the quotient rule for differentiation:
3.1.3 Computing the Third Term: 
Generally, scores result from a linear transformation of the input; hence:
3.1.4 Final Result:
![\begin{aligned}
\frac{\partial L_i}{\partial w_i} &= \frac{\partial L_i}{\partial p_i}\cdot \frac{\partial p_i}{\partial s_i}\cdot \frac{\partial s_i}{\partial w_i} \
&= \left[-\frac{y_i}{\sigma(s_i)}+\frac{1-y_i}{1-\sigma(s_i)}\right] \cdot \sigma(s_i)\cdot [1-\sigma(s_i)]\cdot x_i \
&= \left[-\frac{y_i}{\sigma(s_i)}\cdot \sigma(s_i)\cdot (1-\sigma(s_i))+\frac{1-y_i}{1-\sigma(s_i)}\cdot \sigma(s_i)\cdot (1-\sigma(s_i))\right]\cdot x_i \
&= [-y_i+y_i\cdot \sigma(s_i)+\sigma(s_i)-y_i\cdot \sigma(s_i)]\cdot x_i \
&= [\sigma(s_i)-y_i]\cdot x_i \
\end{aligned} \](https://www.zhihu.com/equation?tex=\begin{aligned}++\frac{\partial+L_i}{\partial+w_i}+%26%3D+\frac{\partial+L_i}{\partial+p_i}\cdot+\frac{\partial+p_i}{\partial+s_i}\cdot+\frac{\partial+s_i}{\partial+w_i}+\\+++%26%3D+[-\frac{y_i}{\sigma(s_i)}%2B\frac{1-y_i}{1-\sigma(s_i)}]+\cdot+\sigma(s_i)\cdot+[1-\sigma(s_i)]\cdot+x_i+\\+++%26%3D+[-\frac{y_i}{\sigma(s_i)}\cdot+\sigma(s_i)\cdot+(1-\sigma(s_i))%2B\frac{1-y_i}{1-\sigma(s_i)}\cdot+\sigma(s_i)\cdot+(1-\sigma(s_i))]\cdot+x_i+\\+++%26%3D+[-y_i%2By_i\cdot+\sigma(s_i)%2B\sigma(s_i)-y_i\cdot+\sigma(s_i)]\cdot+x_i+\\++%26%3D+[\sigma(s_i)-y_i]\cdot+x_i+\\+\end{aligned}+\\\u0026consumer=ZHI_MENG)
As shown above, we obtain an elegant result. Thus, using the cross-entropy loss function not only effectively measures model performance but also facilitates straightforward derivative computation.
3.2 Multiclass Case
Learning process of the multiclass cross-entropy loss function
As illustrated in the figure above, the differentiation process can be decomposed into three sub-processes:
The distinction from binary classification lies in the following:
-
In multiclass classification, only one class has a label of
, while all others are labeled
. Without loss of generality, assume
equals
, and all other labels equal
. Consequently, only the term
contributes to the summation in the loss function, i.e.,
.
-
When differentiating
, a case analysis is required depending on whether
equals
(here,
denotes the true class label of the sample, and
denotes the index of the parameter
for which we compute the gradient with respect to its corresponding score
).
3.2.1 Computing the First Term: 
Without loss of generality, assume equals
, and all other labels equal
. Then,
Differentiating yields:
3.2.2 Computing the Second Term: 
This term computes the derivative of the softmax function with respect to its input scores. Let us first recall the definitions of the softmax function and the quotient rule for derivatives:
Here, denotes the true class label of the sample, and
denotes the index of the parameter
for which we compute the gradient with respect to its corresponding score
. Two cases arise:
Case 1:
Then, the second term’s derivative simplifies to:
Differentiating yields:
Case 2:
In this case, appears only in the denominator; thus, differentiating yields:
3.2.3 Computing the Third Term: \frac{\partial s_{ic}}{\partial w_{ic}}
In general, scores are the result of a linear function applied to the input; thus:
3.2.4 Final Result: \frac{\partial L_{i}}{\partial w_{ic}}
Case 1: c = k
$$\begin{aligned}
\frac{\partial L_{i}}{\partial w_{ic}} &= \frac{\partial L_{i}}{\partial p_{ik}} \cdot \frac{\partial p_{ik}}{\partial s_{ic}} \cdot \frac{\partial s_{ic}}{\partial w_{ic}} \
&= \left(-\frac{1}{p_{ik}}\right) \cdot \left[p_{ik} \cdot (1-p_{ik})\right] \cdot x_{ik} \
&= (p_{ik} - 1) \cdot x_{ik} \
&= (p_{ik} - y_{ik}) \cdot x_{ik} \
&= \left[\sigma(s_{ik}) - y_{ik}\right] \cdot x_{ik} \
\end{aligned}$$
Case 2: c \neq k
$$\begin{aligned}
\frac{\partial L_{i}}{\partial w_{ic}} &= \frac{\partial L_{i}}{\partial p_{ik}} \cdot \frac{\partial p_{ik}}{\partial s_{ic}} \cdot \frac{\partial s_{ic}}{\partial w_{ic}} \
&= \left(-\frac{1}{p_{ik}}\right) \cdot \left[-p_{ik} \cdot p_{ic}\right] \cdot x_{ic} \
&= p_{ic} \cdot x_{ic} \
&= (p_{ic} - 0) \cdot x_{ic} \
&= (p_{ic} - y_{ic}) \cdot x_{ic} \
&= \left[\sigma(s_{ic}) - y_{ic}\right] \cdot x_{ic} \
\end{aligned}$$
Without loss of generality, we assumed above that the true class label for sample i is k, so:
When substituting the values of y into the derivative expressions for different cases, we obtain a unified expression. Using vectorized notation, the derivative no longer needs to be written separately for each case and simplifies to:
We observe that, after vectorization, the derivative form of the cross-entropy loss function is identical for both binary and multi-class classification.
4. Advantages and Disadvantages
4.1 Advantages
When updating parameters using gradient descent, the model’s learning speed depends on two factors: (1) the learning rate, and (2) the gradient magnitude. The learning rate is a hyperparameter we set manually, so our focus lies on the gradient magnitude. From the equations above, we see that the gradient magnitude depends on x_i and [\sigma(s) - y]. We pay particular attention to the latter: its magnitude reflects how poorly the model performs—larger values indicate poorer performance. Importantly, larger magnitudes also produce larger gradients, accelerating learning. Thus, when using the sigmoid (or softmax) activation to compute probabilities combined with cross-entropy as the loss function, the model learns faster when its performance is poor and slows down as performance improves.
4.2 Disadvantages
Deng et al. [4] proposed ArcFace Loss in 2019 and identified two shortcomings of Softmax Loss in their paper:
- As the number of classes increases, the size of the linear transformation matrix in the classification layer grows accordingly.
- For closed-set classification problems, the learned features are separable; however, for open-set face recognition tasks, the learned features lack sufficient discriminability. Face recognition is inherently an open-set problem: the number of identities (classes) is large and continually expanding with new faces.
Additionally, sigmoid (or softmax) combined with cross-entropy loss excels at learning inter-class information due to its inter-class competition mechanism—it focuses solely on the accuracy of the predicted probability for the correct class label while ignoring differences among incorrect labels. This results in relatively dispersed learned features. Numerous optimizations have been proposed to address this issue—for example, improvements to softmax such as L-Softmax, SM-Softmax, and AM-Softmax.
5. References
[1]. Blog – Why Neural Network Classification Models Use Cross-Entropy Loss
[2]. Blog – Softmax as a Neural Network Activation Function
[3]. Blog – A Gentle Introduction to Cross-Entropy Loss Function
[4]. Deng, Jiankang, et al. “Arcface: Additive angular margin loss for deep face recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.






