KL散度KL divergence

Prerequisite Knowledge

High school math: Logarithm: High School Math Basics — Logarithm - :hammer_and_wrench: Tools and Programming / Mathematics - Beginner
Middle school math: Square roots

Main Content

KL Divergence (Kullback-Leibler divergence, KL divergence) is used to describe differences, such as the difference between model predictions and actual results in artificial intelligence.

Discrete probability formula

D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)

Continuous probability formula

D_{KL}(P \parallel Q) = \int P(x) \log \left( \frac{P(x)}{Q(x)} \right) dx

Where Does KL Divergence Come From?

It is used to measure the distance between probability distributions.
For example, a fair coin where the probabilities of heads and tails are both 0.5.

But now you have a biased coin where the probability of heads is higher. How do you determine how large the bias is? You can directly compare the proportions of heads that the two coins produce.

files/Intuitively Understanding the KL Divergence 00-02-20.png

Now you have two biased coins; how do you compare how biased they are? Again, by comparing the proportions of heads they produce, you can determine the degree of bias.

Alternatively, we can calculate differently. To produce the same sequence, for example, this sequence

H(head)HHTHHTHHHTHT

The probability that coin 1 produces this sequence is p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1
Coin 2’s probability is q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_2

files/Intuitively Understanding the KL Divergence 00-02-55.png

Simplified, this is
Coin 1’s probability is p_1^{N_H} p_2^{N_T}
Coin 2’s is q_1^{N_H} q_2^{N_T}

Then calculating the ratio of the two probabilities gives the degree of difference.

files/Intuitively Understanding the KL Divergence 00-03-15.png

Next, we begin transforming the formula, first changing the form for easier conversion later.

\left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right) = \log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right)^{\frac{1}{N}}

Continue transforming

\log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right)^{\frac{1}{N}} = \frac{1}{N} \log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right)

Using the property of logarithms to expand the log. Here we use the power rule of logarithms log(a^b) = b \times log(a)

\frac{1}{N} \log \left(\frac{p_1^{N_H} p_2^{N_T}}{q_1^{N_H} q_2^{N_T}}\right) = \frac{1}{N} \log p_1^{N_H} + \frac{1}{N} \log p_2^{N_T} - \frac{1}{N} \log q_1^{N_H} - \frac{1}{N} \log q_2^{N_T}

Replace \frac{1}{N} with \frac{N_H}{N} and \frac{N_T}{N}

\frac{1}{N} \log p_1^{N_H} + \frac{1}{N} \log p_2^{N_T} - \frac{1}{N} \log q_1^{N_H} - \frac{1}{N} \log q_2^{N_T} = \frac{N_H}{N} \log p_1 + \frac{N_T}{N} \log p_2 - \frac{N_H}{N} \log q_1 - \frac{N_T}{N} \log q_2

Rearranging terms

\frac{N_H}{N} \log p_1 + \frac{N_T}{N} \log p_2 - \frac{N_H}{N} \log q_1 - \frac{N_T}{N} \log q_2 = \left(\frac{N_H}{N} \log p_1 - \frac{N_H}{N} \log q_1\right) + \left(\frac{N_T}{N} \log p_2 - \frac{N_T}{N} \log q_2\right)

Finally, this converts to

\left(\frac{N_H}{N} \log p_1 - \frac{N_H}{N} \log q_1\right) + \left(\frac{N_T}{N} \log p_2 - \frac{N_T}{N} \log q_2\right) = p_1 \log \frac{p_1}{q_1} + p_2 \log \frac{p_2}{q_2}

Isn’t this very similar to the discrete form of KL Divergence? The difference is that here there are two coins; for more cases we use \sum_{i}

D_{KL}(P \parallel Q) = \sum_{i} P(i) \log \left( \frac{P(i)}{Q(i)} \right)

Supplement

Discrete probability is used for discrete variables, such as blood pressure categories.
Continuous probability is used for continuous variables, such as blood pressure values.

References

This content is mainly from Intuitively Understanding the KL Divergence - YouTube