This article is transcoded by 简悦 SimpRead, original at martinlwx.github.io
Each column represents the actual condition, each row represents our prediction, and combined, this forms a confusion matrix. For example, a binary classification task can create the following confusion matrix ![]()
| Positive | Negative | |
|---|---|---|
| True | TP = True Positive | FP = False Positive |
| False | FN = False Negative | TN = True Negative |
How many samples did we predict correctly
The cases where predictions are correct are as follows ![]()
- TP: It was positive, and you also predicted positive
- TN: It was negative, and you also predicted negative
Then using TP + TN (which is actually the main diagonal) divided by the total number of samples gives the accuracy,
Or directly look at the table, it’s actually the sum of the four diagonal cells: TP + TN + FN + FP = TP + TN
Generally, accuracy is a useful evaluation metric, but in some cases, it’s not, such as when the sample data is imbalanced.
| Class A | Class B | |
|---|---|---|
| Predicted class A | 0 | 0 |
| Predicted class B | 1 | 99 |
Suppose there is 1 sample of class A and 99 samples of class B; if you always return class B regardless of input, what is the accuracy?
The answer is 0 + 0 + 1 + 99 / 0 + 0 + 1 + 99 = 99%. Can we say this classifier is good? This is obviously absurd ![]()
Of all samples predicted as True, how many are truly positive
By definition, this is actually the proportion of positives in the “Predicted True” row of the table, which is TP / (TP + FP)
Of all originally positive samples, how many did we successfully predict
By definition, all originally positive samples are the sum of the “Positive” column, and the successfully predicted ones are TP, so it is TP / (TP + FN)
We often need to trade off between Precision and Recall, because increasing one will lower the other
Precisely because of the mutually exclusive nature of Precision and Recall, evaluating classifiers can be troublesome, especially when both are similar. So we need to combine these two metrics into another metric, the F1 score.
The calculation of F1 score is as follows:
where Precision = P; Recall = R
- Macro: Calculate the P and R for each class separately, then calculate the overall average P and R, finally use these to calculate the F1 score
- Micro: Combine multiple class statistics into one table, then calculate P, R, and F1 score
This will be illustrated with examples later ![]()
- The F1 score is always between precision and recall,
- The F1 score gives more weight to the lower value (P or R), so if arithmetic means are the same, the classifier whose weakness is shorter has a worse F1 score
- Case 1: If P and R are both 60, the F1 score = 60
- Case 2: If P and R are 50 and 70 respectively, their average is the same as case 1, but F1 score = 58.3
- From 1, we know:
A high F1 score does not necessarily mean the classifier is better or more suitable for your task; sometimes you may care more about either Precision or Recall
| class 0 | class 1 | class 2 | |
|---|---|---|---|
| predict_class0 | 2 | 0 | 0 |
| predict_class1 | 1 | 0 | 1 |
| predict_class2 | 0 | 2 | 0 |
First, get the Precision and Recall for each class by drawing tables ![]()
| class 0 | Not class 0 | |
|---|---|---|
| predict_class0 | 2 | 1 |
| predict_not_class0 | 0 | 3 |
| class 1 | Not class 1 | |
|---|---|---|
| predict_class1 | 0 | 2 |
| predict_not_class1 | 2 | 2 |
| class 2 | Not class 2 | |
|---|---|---|
| predict_class2 | 0 | 1 |
| predict_not_class2 | 2 | 3 |
The results are:
- class 0: Precision = 2/3; Recall = 1
- class 1: Precision = 0; Recall = 0
- class 2: Precision = 0; Recall = 0
- So the average Precision is (2/3 + 0 + 0)/3 = 2/9 ≈ 0.222; the average Recall is (1 + 0 + 0)/3 = 1/3, plug into the F1 score formula gives 1 / (1/2 + 3/2) = 0.26666
Stack the three tables (element-wise addition) to get the following table
| class ? | Not class ? | |
|---|---|---|
| predict_class? | 2 | 4 |
| predict_not_class? | 4 | 8 |
Using the formula gives F1 score = 1/3 = 0.333333
from sklearn.metrics import f1_score, confusion_matrix
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
print(confusion_matrix(y_true, y_pred)) # confusion matrix
# [[2 0 0]
# [1 0 1]
# [0 2 0]]
print(f1_score(y_true, y_pred, average='macro')) # 0.26666
print(f1_score(y_true, y_pred, average='micro')) # 0.33333