KL散度KL divergence

doggie · November 25, 2025, 10:26am

Prerequisite Knowledge

High school math: Logarithm: High School Math Basics — Logarithm - Tools and Programming / Mathematics - Beginner
Middle school math: Square roots

Main Content

KL Divergence (Kullback-Leibler divergence, KL divergence) is used to describe differences, such as the difference between model predictions and actual results in artificial intelligence.

Discrete probability formula

D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)

Continuous probability formula

D_{KL}(P \parallel Q) = \int P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx

Where Does KL Divergence Come From?

It is used to measure the distance between probability distributions.
For example, a fair coin where the probabilities of heads and tails are both 0.5.

But now you have a biased coin where the probability of heads is higher. How do you determine how large the bias is? You can directly compare the proportions of heads that the two coins produce.

files/Intuitively Understanding the KL Divergence 00-02-20.png

Now you have two biased coins; how do you compare how biased they are? Again, by comparing the proportions of heads they produce, you can determine the degree of bias.

Alternatively, we can calculate differently. To produce the same sequence, for example, this sequence

H（head）HHTHHTHHHTHT

The probability that coin 1 produces this sequence is p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1
Coin 2’s probability is q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_2

files/Intuitively Understanding the KL Divergence 00-02-55.png

Simplified, this is
Coin 1’s probability is p_1^{N_H} p_2^{N_T}
Coin 2’s is q_1^{N_H} q_2^{N_T}

Then calculating the ratio of the two probabilities gives the degree of difference.

files/Intuitively Understanding the KL Divergence 00-03-15.png

Next, we begin transforming the formula, first changing the form for easier conversion later.

\left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right) = \log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right)^{\frac{1}{N}}

Continue transforming

\log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right)^{\frac{1}{N}} = \frac{1}{N} \log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right)

Using the property of logarithms to expand the log. Here we use the power rule of logarithms log(a^b) = b \times log(a)

\frac{1}{N} \log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right) = \frac{1}{N} \log p_1^{N_H} + \frac{1}{N} \log p_2^{N_T} - \frac{1}{N} \log q_1^{N_H} - \frac{1}{N} \log q_2^{N_T}

Replace \frac{1}{N} with \frac{N_H}{N} and \frac{N_T}{N}

\frac{1}{N} \log p_1^{N_H} + \frac{1}{N} \log p_2^{N_T} - \frac{1}{N} \log q_1^{N_H} - \frac{1}{N} \log q_2^{N_T} = \frac{N_H}{N} \log p_1 + \frac{N_T}{N} \log p_2 - \frac{N_H}{N} \log q_1 - \frac{N_T}{N} \log q_2

Rearranging terms

\frac{N_H}{N} \log p_1 + \frac{N_T}{N} \log p_2 - \frac{N_H}{N} \log q_1 - \frac{N_T}{N} \log q_2 = \left(\frac{N_H}{N} \log p_1 - \frac{N_H}{N} \log q_1\right) + \left(\frac{N_T}{N} \log p_2 - \frac{N_T}{N} \log q_2\right)

Finally, this converts to

\left(\frac{N_H}{N} \log p_1 - \frac{N_H}{N} \log q_1\right) + \left(\frac{N_T}{N} \log p_2 - \frac{N_T}{N} \log q_2\right) = p_1 \log \frac{p_1}{q_1} + p_2 \log \frac{p_2}{q_2}

Isn’t this very similar to the discrete form of KL Divergence? The difference is that here there are two coins; for more cases we use \sum_{i}

D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)

Supplement

Discrete probability is used for discrete variables, such as blood pressure categories.
Continuous probability is used for continuous variables, such as blood pressure values.

References

This content is mainly from Intuitively Understanding the KL Divergence - YouTube

Topic		Replies	Views
K最近邻K-nearest neighbor，KNN 🛠工具与编程 knn	0	11	September 26, 2024
【转载】生存分析简明教程 🧠理论与科研医学统计学 , 生存分析 , kaplan-meier-生存曲线 , cox分析	0	99	August 28, 2024
【转载】什么是困惑度perplexity 🛠工具与编程 perplexity , 测评指标	0	62	November 25, 2025
交叉熵损失cross entropy loss 🛠工具与编程交叉熵 , 分类任务 , 损失函数 , 基本概念	1	22	November 25, 2025
大模型测评指标——困惑度 🛠工具与编程大模型 , 困惑度	0	25	April 29, 2026