From Confusion Matrix to F1 Score

doggie · November 18, 2025, 8:34am

This article is transcoded by 简悦 SimpRead, original at martinlwx.github.io

Each column represents the actual condition, each row represents our prediction, and combined, this forms a confusion matrix. For example, a binary classification task can create the following confusion matrix

	Positive	Negative
True	TP = True Positive	FP = False Positive
False	FN = False Negative	TN = True Negative

How many samples did we predict correctly

The cases where predictions are correct are as follows

TP: It was positive, and you also predicted positive
TN: It was negative, and you also predicted negative

Then using TP + TN (which is actually the main diagonal) divided by the total number of samples gives the accuracy,

Or directly look at the table, it’s actually the sum of the four diagonal cells: TP + TN + FN + FP = TP + TN

Generally, accuracy is a useful evaluation metric, but in some cases, it’s not, such as when the sample data is imbalanced.

	Class A	Class B
Predicted class A	0	0
Predicted class B	1	99

Suppose there is 1 sample of class A and 99 samples of class B; if you always return class B regardless of input, what is the accuracy?

The answer is 0 + 0 + 1 + 99 / 0 + 0 + 1 + 99 = 99%. Can we say this classifier is good? This is obviously absurd

Of all samples predicted as True, how many are truly positive

By definition, this is actually the proportion of positives in the “Predicted True” row of the table, which is TP / (TP + FP)

Of all originally positive samples, how many did we successfully predict

By definition, all originally positive samples are the sum of the “Positive” column, and the successfully predicted ones are TP, so it is TP / (TP + FN)

We often need to trade off between Precision and Recall, because increasing one will lower the other

Precisely because of the mutually exclusive nature of Precision and Recall, evaluating classifiers can be troublesome, especially when both are similar. So we need to combine these two metrics into another metric, the F1 score.

The calculation of F1 score is as follows:

F1\ score = \frac{2PR}{P + R}

where Precision = P; Recall = R

Macro: Calculate the P and R for each class separately, then calculate the overall average P and R, finally use these to calculate the F1 score
Micro: Combine multiple class statistics into one table, then calculate P, R, and F1 score

This will be illustrated with examples later

The F1 score is always between precision and recall,
The F1 score gives more weight to the lower value (P or R), so if arithmetic means are the same, the classifier whose weakness is shorter has a worse F1 score
1. Case 1: If P and R are both 60, the F1 score = 60
2. Case 2: If P and R are 50 and 70 respectively, their average is the same as case 1, but F1 score = 58.3
From 1, we know: A high F1 score does not necessarily mean the classifier is better or more suitable for your task; sometimes you may care more about either Precision or Recall

	class 0	class 1	class 2
predict_class0	2	0	0
predict_class1	1	0	1
predict_class2	0	2	0

First, get the Precision and Recall for each class by drawing tables

	class 0	Not class 0
predict_class0	2	1
predict_not_class0	0	3

	class 1	Not class 1
predict_class1	0	2
predict_not_class1	2	2

	class 2	Not class 2
predict_class2	0	1
predict_not_class2	2	3

The results are:

class 0: Precision = 2/3; Recall = 1
class 1: Precision = 0; Recall = 0
class 2: Precision = 0; Recall = 0
So the average Precision is (2/3 + 0 + 0)/3 = 2/9 ≈ 0.222; the average Recall is (1 + 0 + 0)/3 = 1/3, plug into the F1 score formula gives 1 / (1/2 + 3/2) = 0.26666

Stack the three tables (element-wise addition) to get the following table

	class ?	Not class ?
predict_class?	2	4
predict_not_class?	4	8

Using the formula gives F1 score = 1/3 = 0.333333

from sklearn.metrics import f1_score, confusion_matrix

y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]

print(confusion_matrix(y_true, y_pred))             # confusion matrix
# [[2 0 0]
#  [1 0 1]
#  [0 2 0]]

print(f1_score(y_true, y_pred, average='macro'))    # 0.26666

print(f1_score(y_true, y_pred, average='micro'))    # 0.33333

Introduction to sklearn’s f1_score

Topic	Replies	Views
【转载】大模型评测指标全解析：如何精准衡量AI模型的性能 💻编程大模型 , 评测 , 指标	7	November 24, 2025
【转载】大语言模型评估的常用方法、指标与框架 🤖人工智能大模型 , 测评指标 , 转载	17	November 24, 2025
交叉熵损失函数 🛠工具与编程	2	August 6, 2025
【转载】语言模型简介 🛠工具与编程	19	March 9, 2025
量化交易工具包（通达信,同花顺,文华麦语言等 🛠工具与编程工具推荐 , 量化交易	24	December 17, 2025

From Confusion Matrix to F1 Score

Related topics