Prerequisite Knowledge
High school math: Logarithm: High School Math Basics — Logarithm -
Tools and Programming / Mathematics - Beginner
Middle school math: Square roots
Main Content
KL Divergence (Kullback-Leibler divergence, KL divergence) is used to describe differences, such as the difference between model predictions and actual results in artificial intelligence.
Discrete probability formula
Continuous probability formula
Where Does KL Divergence Come From?
It is used to measure the distance between probability distributions.
For example, a fair coin where the probabilities of heads and tails are both 0.5.
But now you have a biased coin where the probability of heads is higher. How do you determine how large the bias is? You can directly compare the proportions of heads that the two coins produce.

Now you have two biased coins; how do you compare how biased they are? Again, by comparing the proportions of heads they produce, you can determine the degree of bias.
Alternatively, we can calculate differently. To produce the same sequence, for example, this sequence
H(head)HHTHHTHHHTHT
The probability that coin 1 produces this sequence is p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1 \cdot p_1 \cdot p_2 \cdot p_1 \cdot p_1
Coin 2’s probability is q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_1 \cdot q_1 \cdot q_2 \cdot q_1 \cdot q_2

Simplified, this is
Coin 1’s probability is p_1^{N_H} p_2^{N_T}
Coin 2’s is q_1^{N_H} q_2^{N_T}
Then calculating the ratio of the two probabilities gives the degree of difference.

Next, we begin transforming the formula, first changing the form for easier conversion later.
Continue transforming
Using the property of logarithms to expand the log. Here we use the power rule of logarithms log(a^b) = b \times log(a)
Replace \frac{1}{N} with \frac{N_H}{N} and \frac{N_T}{N}
Rearranging terms
Finally, this converts to
Isn’t this very similar to the discrete form of KL Divergence? The difference is that here there are two coins; for more cases we use \sum_{i}
Supplement
Discrete probability is used for discrete variables, such as blood pressure categories.
Continuous probability is used for continuous variables, such as blood pressure values.
References
This content is mainly from Intuitively Understanding the KL Divergence - YouTube