Cross-Entropy Loss

Original link: Loss Function | Cross-Entropy Loss Function

Loss Function | Cross-Entropy Loss Function

Xiao Feiyu

Most of my technical notes are posted on Zhihu; daily life updates appear on Douyin/Bilibili: “Xiao Feiyu’s Daily Life”

3,069 upvotes

This article discusses the Cross-Entropy loss function, commonly used in classification tasks. But why is it so effective for classification? Let’s begin with a simple classification example.

1. Image Classification Task

We aim to predict an animal’s category—cat, dog, or pig—based on features such as its outline and color in an image. Suppose we currently have two models (with different parameters), both outputting probability values for each class via sigmoid or softmax:

Model 1:

Prediction Ground Truth Correct?
0.3 0.3 0.4 0 0 1 (pig) Correct
undefined -— -—
0.3 0.4 0.3 0 1 0 (dog) Correct
undefined -— -—
0.1 0.2 0.7 1 0 0 (cat) Incorrect
undefined -— -—

Model 1 correctly classifies samples 1 and 2 by only a very narrow margin, while misclassifying sample 3 completely.

Model 2:

Prediction Ground Truth Correct?
0.1 0.2 0.7 0 0 1 (pig) Correct
undefined -— -—
0.1 0.7 0.2 0 1 0 (dog) Correct
undefined -— -—
0.3 0.4 0.3 1 0 0 (cat) Incorrect
undefined -— -—

Model 2 classifies samples 1 and 2 very accurately, and although it misclassifies sample 3, its prediction is relatively close—not wildly off.

Now that we have our models, we need to define a loss function to evaluate their performance on these samples. So what loss functions could we define?

1.1 Classification Error (Classification Error Rate)

The most straightforward definition of a loss function is:
classification error=frac{count of error items}{count of  all items}

Model 1: classification error=frac{1}{3}

Model 2: classification error=frac{1}{3}

We know that although both Model 1 and Model 2 misclassify one sample, Model 2 performs comparatively better—and its loss value should therefore be smaller. Unfortunately, the classification error metric fails to reflect this distinction. While intuitive, this loss function thus performs poorly.

1.2 Mean Squared Error (MSE)

Mean Squared Error is another commonly used loss function, defined as:
MSE=frac{1}{n}sum_{i}^n(hat{y_i}-y_i)^2

Model 1:

begin{aligned}    text{sample 1 loss=}(0.3-0)^2 + (0.3-0)^2 + (0.4-1)^2 = 0.54     text{sample 2 loss=}(0.3-0)^2 + (0.4-1)^2 + (0.3-0)^2 = 0.54     text{sample 3 loss=}(0.1-1)^2 + (0.2-0)^2 + (0.7-0)^2 = 1.34  end{aligned}

Averaging over all samples:

MSE=frac{0.54+0.54+1.34}{3}=0.81

Model 2:

begin{aligned}   & text{sample 1 loss=}(0.1-0)^2 + (0.2-0)^2 + (0.7-1)^2 = 0.14    &text{sample 2 loss=}(0.1-0)^2 + (0.7-1)^2 + (0.2-0)^2 = 0.14    &text{sample 3 loss=}(0.3-1)^2 + (0.4-0)^2 + (0.3-0)^2 = 0.74 end{aligned}

Averaging over all samples:

MSE=frac{0.14+0.14+0.74}{3}=0.34

We observe that MSE correctly identifies Model 2 as superior to Model 1. So why not adopt MSE as the loss function? The main reason is that, in classification tasks where sigmoid/softmax outputs probabilities, using MSE with gradient descent leads to extremely slow learning during early training stages (MSE loss function).

From the above intuitive analysis, it becomes clear that neither Classification Error nor MSE serves well as a loss function for classification tasks. Next, let’s examine how the Cross-Entropy loss function performs.

1.3 Cross-Entropy Loss Function

1.3.1 Mathematical Expression

(1) Binary Classification

In binary classification, the model must predict one of two outcomes. For each class, the predicted probabilities are p and 1-p. The expression (with natural logarithm base e) is:

L = frac{1}{N}sum_{i} L_i = frac{1}{N}sum_{i}-[y_icdot log(p_i) + (1-y_i)cdot log(1-p_i)]

Where:

  • y_i — ground-truth label for sample i; 1 for positive class, 0 for negative class
  • p_i — predicted probability that sample i belongs to the positive class

(2) Multi-Class Classification

Multi-class classification extends the binary case:

L = frac{1}{N}sum_{i} L_i = - frac{1}{N}sum_{i} sum_{c=1}^My_{ic}log(p_{ic})

Where:

  • M — number of classes
  • y_{ic} — indicator function (0 or 1): equals 1 if the true class of sample i is c, else 0
  • p_{ic} — predicted probability that sample i belongs to class c

Now, applying this formula to compute the loss values for our earlier examples:

Model 1:

begin{aligned}    text{sample 1 loss} = - (0times log0.3 + 0times log0.3 + 1times log0.4) = 0.91     text{sample 2 loss} = - (0times log0.3 + 1times log0.4 + 0times log0.3) = 0.91     text{sample 3 loss} = - (1times log0.1 + 0times log0.2 + 0times log0.7) = 2.30  end{aligned}

Averaging over all samples:

L=frac{0.91+0.91+2.3}{3}=1.37

Model 2:egin{aligned}    ext{sample 1 loss} = - (0imes og 0.1 + 0imes og 0.2 + 1imes og 0.7) = 0.35     ext{sample 2 loss} = - (0imes og 0.1 + 1imes og 0.7 + 0imes og 0.2) = 0.35     ext{sample 3 loss} = - (1imes og 0.3 + 0imes og 0.4 + 0imes og 0.4) = 1.20  nd{aligned}

Compute the average loss across all samples:

L=rac{0.35+0.35+1.2}{3}=0.63

The above computation can be implemented using Python’s sklearn library:

from sklearn.metrics import log_loss 
y_true = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] 
y_pred_1 = [[0.3, 0.3, 0.4], [0.3, 0.4, 0.3], [0.1, 0.2, 0.7]] 
y_pred_2 = [[0.1, 0.2, 0.7], [0.1, 0.7, 0.2], [0.3, 0.4, 0.3]] 
print(log_loss(y_true, y_pred_1)) 
print(log_loss(y_true, y_pred_2)) 
____________ 
1.3783888522474517 
0.6391075640678003 

As observed, the cross-entropy loss function effectively captures the performance difference between Model 1 and Model 2.

2. Function Properties

As shown, this function is convex; thus, its derivative yields a globally optimal solution.

3. Learning Process

The cross-entropy loss function is commonly used in classification tasks—especially with neural networks—and is often adopted as the loss function for such problems. Moreover, since cross-entropy involves computing probabilities for each class, it almost always appears alongside the sigmoid (or softmax) function.

Let us examine the complete prediction, loss computation, and learning pipeline using the final-layer output of a neural network:

  1. The neural network’s final layer produces raw scores (also called logits) for each class;
  2. These scores are passed through the sigmoid (or softmax) function, yielding probability outputs;
  3. The predicted class probabilities are then fed into the cross-entropy loss function together with the one-hot encoded true labels.

Learning tasks fall into two categories: binary classification and multi-class classification. We discuss the learning process for both cases separately.

3.1 Binary Classification Case

Binary cross-entropy loss function learning process

As illustrated above, the derivative computation comprises three sub-processes, i.e., the product of three partial derivatives:

rac{artial L_i}{artial w_i}=rac{artial L_i}{artial p_i}dot rac{artial p_i}{artial s_i}dot rac{artial s_i}{artial w_i}

3.1.1 Computing the First Term: frac{partial L_i}{partial p_i}

L_i = -[y_idot og(p_i) + (1-y_i)dot og(1-p_i)]

- p_i denotes the predicted probability that sample i belongs to the positive class.
- y_i is an indicator function: it equals 1 if sample i belongs to the positive class, and 0 otherwise.

egin{aligned} rac{artial L_i}{artial p_i} &=rac{artial -[y_idot og(p_i) + (1-y_i)dot og(1-p_i)]}{artial p_i}  &= -rac{y_i}{p_i}-[(1-y_i)dot rac{1}{1-p_i}dot (-1)]    &= -rac{y_i}{p_i}+rac{1-y_i}{1-p_i}   &= -rac{y_i}{igma(s_i)}+rac{1-y_i}{1-igma(s_i)}  nd{aligned}

3.1.2 Computing the Second Term: frac{partial p_i}{partial s_i}

This term represents the derivative of the sigmoid function with respect to the score. First, recall the definition of the sigmoid function and the quotient rule for differentiation:

p = igma(s) = rac{e^{s}}{1+e^{s}}
f'(x) = rac{g(x)}{h(x)}=rac{g'(x)h(x)-g(x){h}'(x)}{h^2(x)}

egin{aligned}  rac{artial p_i}{artial s_i} &= rac{(e^{s_i})'dot (1+e^{s_i})-e^{s_i}dot (1+e^{s_i})'}{(1+e^{s_i})^2}   &= rac{e^{s_i}dot (1+e^{s_i})-e^{s_i}dot e^{s_i}}{(1+e^{s_i})^2}   &= rac{e^{s_i}}{(1+e^{s_i})^2}   &= rac{e^{s_i}}{1+e^{s_i}}dot rac{1}{1+e^{s_i}}   &= igma(s_i)dot [1-igma(s_i)]  nd{aligned}

3.1.3 Computing the Third Term: frac{partial s_i}{partial w_i}

Generally, scores result from a linear transformation of the input; hence:

rac{artial s_i}{artial w_i}=x_i

3.1.4 Final Result: frac{partial L_i}{partial w_i}![\begin{aligned}

\frac{\partial L_i}{\partial w_i} &= \frac{\partial L_i}{\partial p_i}\cdot \frac{\partial p_i}{\partial s_i}\cdot \frac{\partial s_i}{\partial w_i} \
&= \left[-\frac{y_i}{\sigma(s_i)}+\frac{1-y_i}{1-\sigma(s_i)}\right] \cdot \sigma(s_i)\cdot [1-\sigma(s_i)]\cdot x_i \
&= \left[-\frac{y_i}{\sigma(s_i)}\cdot \sigma(s_i)\cdot (1-\sigma(s_i))+\frac{1-y_i}{1-\sigma(s_i)}\cdot \sigma(s_i)\cdot (1-\sigma(s_i))\right]\cdot x_i \
&= [-y_i+y_i\cdot \sigma(s_i)+\sigma(s_i)-y_i\cdot \sigma(s_i)]\cdot x_i \
&= [\sigma(s_i)-y_i]\cdot x_i \
\end{aligned} \](https://www.zhihu.com/equation?tex=\begin{aligned}++\frac{\partial+L_i}{\partial+w_i}+%26%3D+\frac{\partial+L_i}{\partial+p_i}\cdot+\frac{\partial+p_i}{\partial+s_i}\cdot+\frac{\partial+s_i}{\partial+w_i}+\\+++%26%3D+[-\frac{y_i}{\sigma(s_i)}%2B\frac{1-y_i}{1-\sigma(s_i)}]+\cdot+\sigma(s_i)\cdot+[1-\sigma(s_i)]\cdot+x_i+\\+++%26%3D+[-\frac{y_i}{\sigma(s_i)}\cdot+\sigma(s_i)\cdot+(1-\sigma(s_i))%2B\frac{1-y_i}{1-\sigma(s_i)}\cdot+\sigma(s_i)\cdot+(1-\sigma(s_i))]\cdot+x_i+\\+++%26%3D+[-y_i%2By_i\cdot+\sigma(s_i)%2B\sigma(s_i)-y_i\cdot+\sigma(s_i)]\cdot+x_i+\\++%26%3D+[\sigma(s_i)-y_i]\cdot+x_i+\\+\end{aligned}+\\\u0026consumer=ZHI_MENG)

As shown above, we obtain an elegant result. Thus, using the cross-entropy loss function not only effectively measures model performance but also facilitates straightforward derivative computation.

3.2 Multiclass Case

Learning process of the multiclass cross-entropy loss function

As illustrated in the figure above, the differentiation process can be decomposed into three sub-processes:

rac{artial L_i}{artial w_{ic}}=rac{artial L_{i}}{artial p_{ik}}dot rac{artial p_{ik}}{artial s_{ic}}dot rac{artial s_{ic}}{artial w_{ic}}

The distinction from binary classification lies in the following:

  • In multiclass classification, only one class has a label of 1, while all others are labeled 0. Without loss of generality, assume y_{ik} equals 1, and all other labels equal 0. Consequently, only the term y_{ik} contributes to the summation in the loss function, i.e., L_i = -og(p_{ik}).

  • When differentiating rac{artial p_{ik}}{artial s_{ic}}, a case analysis is required depending on whether c equals k (here, k denotes the true class label of the sample, and c denotes the index of the parameter w_{ic} for which we compute the gradient with respect to its corresponding score s_{ic}).

3.2.1 Computing the First Term: rac{artial L_i}{artial p_{ik}}

Without loss of generality, assume y_{ik} equals 1, and all other labels equal 0. Then,

L_i = -og(p_{ik})

Differentiating yields:

egin{aligned} rac{artial L_i}{artial p_{ik}} &=rac{artial -og(p_{ik})}{artial p_{ik}} &= -rac{1}{p_{ik}} nd{aligned}

3.2.2 Computing the Second Term: rac{artial p_{ik}}{artial s_{ic}}

This term computes the derivative of the softmax function with respect to its input scores. Let us first recall the definitions of the softmax function and the quotient rule for derivatives:

p_{ik} = igma(s_{ik}) = rac{e^{s_{ik}}}{um e^{s_{ij}}}
f'(x) = rac{g(x)}{h(x)}=rac{g'(x)h(x)-g(x){h}'(x)}{h^2(x)}

Here, k denotes the true class label of the sample, and c denotes the index of the parameter w_{ic} for which we compute the gradient with respect to its corresponding score s_{ic}. Two cases arise:

Case 1: c=k

Then, the second term’s derivative simplifies to:

rac{artial p_{ik}}{artial s_{ic}} = rac{artial p_{ik}}{artial s_{ik}}

Differentiating yields:

egin{aligned} rac{artial p_{ik}}{artial s_{ik}} &= rac{rac{artial e^{s_{ik}}}{artial s_{ik}}dot um e^{s_{ij}} -e^{s_{ik}}dot rac{artial um e^{s_{ij}}}{artial s_{ik}}}{(um e^{s_{ij}})^2}  &= rac{e^{s_{ik}}dot um e^{s_{ij}}-e^{s_{ik}}dot e^{s_{ik}}}{(um e^{s_{ij}})^2}  &= rac{e^{s_{ik}}}{um e^{s_{ij}}} - (rac{e^{s_{ik}}}{um e^{s_{ij}}})^2  &= rac{e^{s_{ik}}}{um e^{s_{ij}}}dot (1-rac{e^{s_{ik}}}{um e^{s_{ij}}})  &= p_{ik}dot (1-p_{ik})  nd{aligned}

Case 2: ceq k

In this case, s_{ic} appears only in the denominator; thus, differentiating yields:egin{aligned} rac{artial p_{ik}}{artial s_{ic}} &= rac{rac{artial e^{s_{ik}}}{artial s_{ic}}dot um e^{s_{ij}} -e^{s_{ik}}dot rac{artial um e^{s_{ij}}}{artial s_{ic}}}{(um e^{s_{ij}})^2}  &= rac{0dot um e^{s_{ij}}-e^{s_{ik}}dot e^{s_{ic}}}{(um e^{s_{ij}})^2}  &= -rac{e^{s_{ik}}dot e^{s_{ic}}}{(um e^{s_{ij}})^2}  &= -rac{e^{s_{ik}}}{um e^{s_{ij}}}dot rac{e^{s_{ic}}}{um e^{s_{ij}}}  &= -p_{ik}dot p_{ic}  nd{aligned}

3.2.3 Computing the Third Term: \frac{\partial s_{ic}}{\partial w_{ic}}

In general, scores are the result of a linear function applied to the input; thus:

\frac{\partial s_{ic}}{\partial w_{ic}} = x_{ic}

3.2.4 Final Result: \frac{\partial L_{i}}{\partial w_{ic}}

\frac{\partial L_{i}}{\partial w_{ic}} = \frac{\partial L_{i}}{\partial p_{ik}} \cdot \frac{\partial p_{ik}}{\partial s_{ic}} \cdot \frac{\partial s_{ic}}{\partial w_{ic}}

Case 1: c = k

$$\begin{aligned}
\frac{\partial L_{i}}{\partial w_{ic}} &= \frac{\partial L_{i}}{\partial p_{ik}} \cdot \frac{\partial p_{ik}}{\partial s_{ic}} \cdot \frac{\partial s_{ic}}{\partial w_{ic}} \
&= \left(-\frac{1}{p_{ik}}\right) \cdot \left[p_{ik} \cdot (1-p_{ik})\right] \cdot x_{ik} \
&= (p_{ik} - 1) \cdot x_{ik} \
&= (p_{ik} - y_{ik}) \cdot x_{ik} \
&= \left[\sigma(s_{ik}) - y_{ik}\right] \cdot x_{ik} \
\end{aligned}$$

Case 2: c \neq k

$$\begin{aligned}
\frac{\partial L_{i}}{\partial w_{ic}} &= \frac{\partial L_{i}}{\partial p_{ik}} \cdot \frac{\partial p_{ik}}{\partial s_{ic}} \cdot \frac{\partial s_{ic}}{\partial w_{ic}} \
&= \left(-\frac{1}{p_{ik}}\right) \cdot \left[-p_{ik} \cdot p_{ic}\right] \cdot x_{ic} \
&= p_{ic} \cdot x_{ic} \
&= (p_{ic} - 0) \cdot x_{ic} \
&= (p_{ic} - y_{ic}) \cdot x_{ic} \
&= \left[\sigma(s_{ic}) - y_{ic}\right] \cdot x_{ic} \
\end{aligned}$$

Without loss of generality, we assumed above that the true class label for sample i is k, so:

y_{ik} = 1
y_{ic} = 0,\quad c \neq k

When substituting the values of y into the derivative expressions for different cases, we obtain a unified expression. Using vectorized notation, the derivative no longer needs to be written separately for each case and simplifies to:

\frac{\partial L_{i}}{\partial w_{i}} = \left[\sigma(s_i) - y_i\right] \cdot x_i

We observe that, after vectorization, the derivative form of the cross-entropy loss function is identical for both binary and multi-class classification.

4. Advantages and Disadvantages

4.1 Advantages

When updating parameters using gradient descent, the model’s learning speed depends on two factors: (1) the learning rate, and (2) the gradient magnitude. The learning rate is a hyperparameter we set manually, so our focus lies on the gradient magnitude. From the equations above, we see that the gradient magnitude depends on x_i and [\sigma(s) - y]. We pay particular attention to the latter: its magnitude reflects how poorly the model performs—larger values indicate poorer performance. Importantly, larger magnitudes also produce larger gradients, accelerating learning. Thus, when using the sigmoid (or softmax) activation to compute probabilities combined with cross-entropy as the loss function, the model learns faster when its performance is poor and slows down as performance improves.

4.2 Disadvantages

Deng et al. [4] proposed ArcFace Loss in 2019 and identified two shortcomings of Softmax Loss in their paper:

  1. As the number of classes increases, the size of the linear transformation matrix in the classification layer grows accordingly.
  2. For closed-set classification problems, the learned features are separable; however, for open-set face recognition tasks, the learned features lack sufficient discriminability. Face recognition is inherently an open-set problem: the number of identities (classes) is large and continually expanding with new faces.

Additionally, sigmoid (or softmax) combined with cross-entropy loss excels at learning inter-class information due to its inter-class competition mechanism—it focuses solely on the accuracy of the predicted probability for the correct class label while ignoring differences among incorrect labels. This results in relatively dispersed learned features. Numerous optimizations have been proposed to address this issue—for example, improvements to softmax such as L-Softmax, SM-Softmax, and AM-Softmax.

5. References

[1]. Blog – Why Neural Network Classification Models Use Cross-Entropy Loss

[2]. Blog – Softmax as a Neural Network Activation Function

[3]. Blog – A Gentle Introduction to Cross-Entropy Loss Function

[4]. Deng, Jiankang, et al. “Arcface: Additive angular margin loss for deep face recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.


Cross-entropy loss is a common primary loss function for classification tasks.