From Confusion Matrix to F1 Score

This article is transcoded by 简悦 SimpRead, original at martinlwx.github.io

Each column represents the actual condition, each row represents our prediction, and combined, this forms a confusion matrix. For example, a binary classification task can create the following confusion matrix :down_arrow:

PositiveNegative
TrueTP = True PositiveFP = False Positive
FalseFN = False NegativeTN = True Negative

How many samples did we predict correctly

The cases where predictions are correct are as follows :down_arrow:

  • TP: It was positive, and you also predicted positive
  • TN: It was negative, and you also predicted negative

Then using TP + TN (which is actually the main diagonal) divided by the total number of samples gives the accuracy,

:ledger: Or directly look at the table, it’s actually the sum of the four diagonal cells: TP + TN + FN + FP = TP + TN

Generally, accuracy is a useful evaluation metric, but in some cases, it’s not, such as when the sample data is imbalanced.

Class AClass B
Predicted class A00
Predicted class B199

Suppose there is 1 sample of class A and 99 samples of class B; if you always return class B regardless of input, what is the accuracy?

The answer is 0 + 0 + 1 + 99 / 0 + 0 + 1 + 99 = 99%. Can we say this classifier is good? This is obviously absurd :joy:

Of all samples predicted as True, how many are truly positive

By definition, this is actually the proportion of positives in the “Predicted True” row of the table, which is TP / (TP + FP)

Of all originally positive samples, how many did we successfully predict

By definition, all originally positive samples are the sum of the “Positive” column, and the successfully predicted ones are TP, so it is TP / (TP + FN)

:ledger: We often need to trade off between Precision and Recall, because increasing one will lower the other

Precisely because of the mutually exclusive nature of Precision and Recall, evaluating classifiers can be troublesome, especially when both are similar. So we need to combine these two metrics into another metric, the F1 score.

The calculation of F1 score is as follows:

F1\ score = \frac{2PR}{P + R}

where Precision = P; Recall = R

  1. Macro: Calculate the P and R for each class separately, then calculate the overall average P and R, finally use these to calculate the F1 score
  2. Micro: Combine multiple class statistics into one table, then calculate P, R, and F1 score

This will be illustrated with examples later :chestnut:

  1. The F1 score is always between precision and recall,
  2. The F1 score gives more weight to the lower value (P or R), so if arithmetic means are the same, the classifier whose weakness is shorter has a worse F1 score
    1. Case 1: If P and R are both 60, the F1 score = 60
    2. Case 2: If P and R are 50 and 70 respectively, their average is the same as case 1, but F1 score = 58.3
  3. From 1, we know: :ledger: A high F1 score does not necessarily mean the classifier is better or more suitable for your task; sometimes you may care more about either Precision or Recall
class 0class 1class 2
predict_class0200
predict_class1101
predict_class2020

First, get the Precision and Recall for each class by drawing tables :down_arrow:

class 0Not class 0
predict_class021
predict_not_class003
class 1Not class 1
predict_class102
predict_not_class122
class 2Not class 2
predict_class201
predict_not_class223

The results are:

  • class 0: Precision = 2/3; Recall = 1
  • class 1: Precision = 0; Recall = 0
  • class 2: Precision = 0; Recall = 0
  • So the average Precision is (2/3 + 0 + 0)/3 = 2/9 ≈ 0.222; the average Recall is (1 + 0 + 0)/3 = 1/3, plug into the F1 score formula gives 1 / (1/2 + 3/2) = 0.26666

Stack the three tables (element-wise addition) to get the following table

class ?Not class ?
predict_class?24
predict_not_class?48

Using the formula gives F1 score = 1/3 = 0.333333

from sklearn.metrics import f1_score, confusion_matrix

y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print(confusion_matrix(y_true, y_pred))             # confusion matrix
# [[2 0 0]
#  [1 0 1]
#  [0 2 0]]

print(f1_score(y_true, y_pred, average='macro'))    # 0.26666

print(f1_score(y_true, y_pred, average='micro'))    # 0.33333
  1. Introduction to sklearn’s f1_score