July 15, 2025
Welcome back to my blog! Today, we're diving into KL divergence (Kullback-Leibler divergence), a fascinating concept used to measure how much two probability distributions differ. Instead of jumping straight into formulas, we'll build our understanding from the ground up, using an intuitive example. Let's make this fun and insightful!
Imagine we're looking at the subjects chosen by students in two different years at a school. We'll represent these choices as probability distributions to see how preferences shift over time. So lets say we have a 2 graphs each displaying the subjects chosen by students in each year
In 2024, students chose subjects as follows:
Let's call this distribution Q(x)
In 2025, the preferences shifted:
We'll call this distribution P(x). Notice that English stays the same, but Spanish and Japanese swap their proportions!
We want to quantify how different these two distributions (P(x) and Q(x)) are. You might think, "Let's compare the distributions by dividing their probabilities and summing the ratios."
For each subject:
BUT? is this correct way to find difference in distribution? While we can see in graph for both years that Spanish and Japanese have swapped in terms of popularity, these raw ratios don't capture that symmetry. One is 0.25, the other is 4. There's no clear sense of balance or symmetry in these numbers.
In fact, they're misleading. They suggest one subject changed drastically (Japanese), while the other barely did (Spanish), which isn't true. They both changed by the same amount, just in opposite directions.
So what we want is a method that reflects that balance like something that treats one going up the same way as the other going down.
What we need is a function that we apply to the ratio and it should give us a positive value when something goes up, and a negative value when something goes down by the same amount. So what we actually need is a function f(x) that takes in this ratio and returns a value that behaves like this:
This way, we treat up and down changes equally but with opposite signs, which makes sense when values just swap (like Spanish and Japanese did).
This gives us two important points we can imagine plotting on a graph:
So we get the points:
Similarly, we can also consider more points to draw graph
for 1 , since both values will be same:
w will come out to be 0 because that is only possible value;plotting this on a graph we will get something like:
Even without diving into the math yet, this shape should feel familiar. We're essentially looking for a curve that reflects change around the point x=1, with equal response on both sides. It turns out, there's a function that behaves exactly this way and its graph is very similar to the one we need and that function is the logarithmic function.
If you recall from math, the graph of the logarithmic function has a shape that is very similar to what we need. Here is a resource in case you want to refresh your memory on Logarithmic Fn
Looks like log is the function which we were looking for ,The logarithmic function gives us exactly what we want:
So if we put our values and into the expression, we get:
For :
For :
So, in both cases, the values are exactly (or nearly) equal in magnitude but opposite in sign. This is exactly what we want to reflect in subject preference shifts.So instead of just using , we use:
Now that we've identified the logarithmic function as the ideal way to quantify these changes, we can start to build a formula that actually measures the overall difference between the two distributions but we still need one more step to make this a weighted average of the differences.
Why? Because not every subject is equally important in the new distribution. For example:
So we should weigh these changes according to how likely each outcome is in the new distribution P(x)
That gives us the full formula for Kullback-Leibler divergence, which is:
This might look a bit mathematical, but it's really just:
x (each subject),Let's plug in our actual subject distributions.
Let's compute for the subjects:
| Subject | P(x) 2025 | Q(x) 2024 | P/Q | log10(P/Q) | P * log10(P/Q) |
|---|---|---|---|---|---|
| English | 0.5 | 0.5 | 1 | 0 | 0 |
| Spanish | 0.1 | 0.4 | 0.25 | -0.602 | 0.1 × -0.602 = -0.0602 |
| Japanese | 0.4 | 0.1 | 4 | 0.602 | 0.4 × 0.602 = 0.2408 |
Now summing these:
KL Divergence ≈ 0.1806 (base-10)
This value tells us that there is a moderate divergence between the preferences of 2024 and 2025, mostly due to the shifts in Spanish and Japanese.
KL divergence is more than just a math formula taught in books, I hope this blog helped you understand the fundamentals behind it. It's a way to measure change, surprise, and uncertainty between two sets of beliefs or realities.
In our case, it helped us capture how student preferences for subjects evolved between 2024 and 2025. And even though raw ratios were misleading, the log-weighted divergence gave us a fair, and powerful measure of difference. If you found this explanation helpful, please share it with your friends and colleagues! If you have any questions or suggestions for future topics, feel free to ping me on Twitter / X or LinkedIn