Pablo Ibieta

Conditional mutual information is just log-loss gain

Pablo Ibieta — Sun, 03 May 2026 00:00:00 GMT

A previous post used mutual information to detect dependence that correlation couldn’t see. The natural next question is conditional: given that I already know , does carry any further information about ? In ML terms, this is the feature- or score-evaluation question — does adding this thing to a model that already uses everything else move the needle?

The right object is conditional mutual information:

i.e. the reduction in residual uncertainty about once we observe , having already conditioned on (Cover & Thomas, 2006). It is non-negative, zero iff , and the magnitude measures how much that conditional independence fails. It’s also, in spite of how it looks, almost free to estimate — which is the point of this post.

The identity

When the output variable is binary, conditional entropy has a particularly clean form. Writing for the entropy of a Bernoulli,

That is: average over the marginal distribution of , evaluated at the conditional probability the data-generating process actually assigns at each . Conditional entropy in the binary target setup is just the expected coin-flip uncertainty, where the bias of the coin depends on the example.

The same shows up from the model training side. With the per-example log-loss

a one-line check shows that when a model outputs the true conditional probability , its expected loss equals exactly the binary entropy of that probability:

This isn’t a coincidence — cross-entropy is a proper scoring rule, so its expected value is minimized exactly at the truth, and the minimum equals the entropy of the truth. Anything above this floor is a misspecification penalty.

Plug both into the entropy-reduction definition and the result is a clean identity:

Read in English: CMI is the expected drop in log-loss between a model that uses alone and a model that uses , evaluated at the truth. Every binary classifier you have ever trained already computes the right-hand side on its validation data. Two such classifiers and a subtraction give an information-theoretic estimate.

The catch (cross-fitting in one paragraph)

Equation 1 holds at the true conditionals and . We don’t have those — we have fitted models and . The trouble is what happens when we evaluate those fitted models’ log-losses on the same data we trained them on.

In-sample log-loss is biased downward. A fitted model has tuned its predictions to match the specific observations it saw, including their noise, so the loss on those observations is systematically lower than the loss the same model would incur on a fresh draw. That bias is not equal across the two models. The full model uses and therefore has strictly more capacity to fit noise than the baseline that uses alone, so its in-sample loss is more optimistic. The CMI estimator is the difference of those two losses, and subtracting two downward-biased quantities doesn’t cancel the bias — it preserves the asymmetry, inflating the estimate.

The cleanest way to see this: imagine carries zero information about given , so the true CMI is exactly zero. The full model can still fit spurious correlations between and in the training sample; the baseline can’t, because it never sees . The naive in-sample estimator will report a positive value where the truth is zero. The bias points upward — toward more apparent CMI than there actually is.

The fix is K-fold cross-fitting, which evaluates every loss on data the model hasn’t seen during training:

Partition the data into disjoint folds.
For each fold : fit and on the data outside fold , then compute the per-example log-loss difference for every example inside fold .
Average those per-example differences across all examples (equivalently, across folds).

Same idea as Chernozhukov et al.’s debiased ML (Chernozhukov et al., 2018) — the auxiliaries are nuisances, and we want their contribution to the downstream estimator to come from out-of-sample predictions only.

Demonstration

The simulation has two jobs. First, give us a setup where the true is computable to arbitrary precision, so there’s a reference curve the estimator can be checked against. Second, sweep through a single parameter that moves the conditional CMI smoothly from its maximum down to exactly zero — so we can see whether the cross-fitted estimator tracks that variation continuously, and whether it correctly hits zero in the limit. A jointly Gaussian setup with one correlation knob serves both jobs cleanly.

Concretely: take , set

and generate with . Two facts about this construction matter for the test. The marginal of is a standard Gaussian for every value of — only its dependence on changes. And the conditional has variance , which collapses to zero as . In that limit becomes a deterministic linear function of , so anything tells us about is already implicit in , and must equal zero. At the other end, , the predictors are independent and contributes its full conditional information. The estimator’s job is to trace that decay.

Computing the truth is direct. The full conditional is closed-form, so is one expectation. The marginal has no closed form, so we average over many draws of for each and plug into . The empirical log-loss-gain estimator uses two gradient-boosted classifiers — one trained on alone, one on — with 5-fold cross-fitting. If equation 1 is right and cross-fitting is doing its job, the cross-fitted curve should sit on top of the Monte-Carlo curve across the entire range of .

import numpy as np
import matplotlib.pyplot as plt
from scipy.special import expit
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold

beta0, beta1, beta2 = -0.3, 1.0, 1.0
LOSS = lambda y, p: -y*np.log(p) - (1-y)*np.log(1-p)

def H_b(p, eps=1e-12):
    p = np.clip(p, eps, 1-eps)
    return -p*np.log(p) - (1-p)*np.log(1-p)

def make_data(alpha, n, seed):
    r  = np.random.default_rng(seed)
    X1 = r.standard_normal(n)
    X2 = alpha*X1 + np.sqrt(1-alpha**2)*r.standard_normal(n)
    Y  = r.binomial(1, expit(beta0 + beta1*X1 + beta2*X2))
    return X1, X2, Y

def truth_cmi(alpha, n_outer=20_000, n_inner=300, seed=0):
    """Closed-form-up-to-MC ground truth I(X2; Y | X1)."""
    r   = np.random.default_rng(seed)
    X1  = r.standard_normal(n_outer)
    X2  = alpha*X1 + np.sqrt(1-alpha**2)*r.standard_normal(n_outer)
    H_full = H_b(expit(beta0 + beta1*X1 + beta2*X2)).mean()
    eta_in = r.standard_normal((n_outer, n_inner))
    X2_in  = alpha*X1[:, None] + np.sqrt(1-alpha**2)*eta_in
    p_marg = expit(beta0 + beta1*X1[:, None] + beta2*X2_in).mean(axis=1)
    return H_b(p_marg).mean() - H_full

def cross_fitted_cmi(X1, X2, Y, K=5, seed=0):
    """Plug in @eq-loglossgain with K-fold cross-fitted auxiliaries."""
    Xfull = np.column_stack([X1, X2])
    fold_means = []
    for tr, te in KFold(K, shuffle=True, random_state=seed).split(X1):
        m_r = GradientBoostingClassifier(max_depth=3, n_estimators=120,
                                         random_state=seed)
        m_f = GradientBoostingClassifier(max_depth=3, n_estimators=120,
                                         random_state=seed)
        m_r.fit(X1[tr].reshape(-1,1), Y[tr])
        m_f.fit(Xfull[tr],            Y[tr])
        p_r = np.clip(m_r.predict_proba(X1[te].reshape(-1,1))[:,1], 1e-12, 1-1e-12)
        p_f = np.clip(m_f.predict_proba(Xfull[te])[:,1],            1e-12, 1-1e-12)
        fold_means.append((LOSS(Y[te], p_r) - LOSS(Y[te], p_f)).mean())
    fm = np.array(fold_means)
    return fm.mean(), fm.std(ddof=1) / np.sqrt(K)

alphas = np.linspace(0.0, 0.95, 7)
truth  = np.array([truth_cmi(a) for a in alphas])
emp, se = zip(*[cross_fitted_cmi(*make_data(a, 8000, seed=int(100*a)+1))
                for a in alphas])
emp, se = np.array(emp), np.array(se)

fig, ax = plt.subplots(figsize=(7, 3.6), constrained_layout=True)
ax.plot(alphas, truth, lw=2, label="truth (Monte-Carlo)")
ax.errorbar(alphas, emp, yerr=se, fmt="o", capsize=3,
            label="cross-fitted log-loss gain")
ax.set_xlabel(r"$\alpha$  (redundancy of $X_2$ given $X_1$)")
ax.set_ylabel(r"$I(X_2; Y \mid X_1)$  (nats)")
ax.legend(frameon=False)
ax.grid(alpha=0.3)
plt.show()

Figure 1: Closed-form (line) vs. cross-fitted log-loss-gain estimate (points fold-level standard error) as becomes redundant given . Two GBDTs and a subtraction recover the information-theoretic curve across the entire range, including the collapse to zero in the fully redundant regime.

Two things to notice in Figure 1. The first is that the estimator works in the only way that matters: it agrees with the truth. At , the closed-form Monte-Carlo gives nats and the cross-fitted log-loss-gain estimator returns the same value to three decimals. The agreement persists across the entire sweep — the cross-fitted points sit within a fold-level standard error of the Monte-Carlo curve everywhere, including the hardest regime where the signal is small. It’s not the magnitude of that’s the evidence; it’s that two routes to it — one going through the population entropy gap, one going through finite-sample held-out log-losses — land at the same place.

The second thing is what the curve’s shape implies for metrics that don’t condition. As grows, the marginal AUC of against in this DGP actually increases (from at to at ), because inherits more of ’s predictive content as the two predictors merge. A marginal-AUC screen would therefore rate as more important at than at — the opposite verdict from CMI, which says ’s conditional contribution given has collapsed to zero. Same , opposite directions. Standalone strength and complementary signal are independent axes, and the conditional form of mutual information is what separates them.

Why this matters

Once equation 1 is in hand, several decisions that look unrelated turn out to be the same calculation with different conditioning sets. Does this score generalize to a different target? — pick and read . Should I ensemble two models? — pick and read . Where in a representation’s interaction hierarchy does signal live? — pick and read . Each is the next post in this series; the machinery is identical.

One-liner

Conditional mutual information expected log-loss reduction. Train two models, subtract, cross-fit. The information-theoretic accounting runs on a loss your training pipeline already computes.

References

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1), C1–C68. https://doi.org/10.1111/ectj.12097

Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Wiley-Interscience.

Welcome

Pablo Ibieta — Sat, 02 May 2026 00:00:00 GMT

This is the first post, so it’s mostly housekeeping.

What I want to write about

The short list:

Information theory — entropy, mutual information, channels, and what they actually buy you in inference and ML.
Causal inference and econometrics — identification, estimators, the gap between “I have a coefficient” and “I have a causal effect.”
Machine learning — methods I’m trying to understand well enough to explain, plus the occasional grumble about benchmarks.
Physics — mostly statistical mechanics and the bits that overlap with information theory and probability.

How posts are built

Every post is a .qmd file — markdown plus executable code chunks. Math is written in LaTeX and rendered with KaTeX. Citations come from a single references.bib and are formatted in APA. The rendered HTML is built by GitHub Actions and served from GitHub Pages.

If a post claims something computational, the simulation that backs it up is in the same file and you can see the source via the Code button in the top right.

Comments / corrections

There aren’t comments yet (and I’m not sure I want them). For now, if you spot a mistake, open an issue on the repo.

Mutual information sees what correlation can’t

Pablo Ibieta — Sat, 02 May 2026 00:00:00 GMT

Pearson correlation is so familiar it’s easy to forget it only sees linear dependence. Two variables can be deterministically related and have correlation exactly zero. Mutual information doesn’t have that blind spot, and looking at why is a clean way into information theory.

Definitions

For continuous random variables and with joint density and marginals , the mutual information is

Equivalently, , where is differential entropy. Two facts make this the natural quantity to reach for:

Properties of mutual information

Non-negativity. , with equality iff .
Invariance. is invariant under any invertible transformation of or separately. Pearson’s is not.

Property (2) is the punchline. If you stretch, log-transform, or reorder the support of — anything reversible — the dependence structure with doesn’t actually change, and agrees. Pearson reports a different number every time.

The simulation

Let and , with . The relationship is deterministic up to noise, but it’s symmetric around zero, so any linear summary of dependence cancels out. We’ll compute Pearson’s and a kNN estimator (Kraskov et al., 2004) of across noise levels.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from sklearn.feature_selection import mutual_info_regression

rng = np.random.default_rng(7)
n = 4_000

def sample(sigma):
    x = rng.uniform(-1, 1, size=n)
    y = x**2 + rng.normal(0, sigma, size=n)
    return x, y

sigmas = np.linspace(0.0, 0.8, 9)
rho, mi = [], []
for s in sigmas:
    x, y = sample(s)
    rho.append(pearsonr(x, y).statistic)
    mi.append(
        mutual_info_regression(x.reshape(-1, 1), y, random_state=0)[0]
    )
rho = np.asarray(rho)
mi  = np.asarray(mi)

fig, ax = plt.subplots(1, 2, figsize=(8.4, 3.4), constrained_layout=True)

x0, y0 = sample(0.05)
ax[0].scatter(x0, y0, s=4, alpha=0.4)
ax[0].set_title(r"$Y = X^2 + \varepsilon$, $\sigma = 0.05$")
ax[0].set_xlabel("X"); ax[0].set_ylabel("Y")

ax[1].plot(sigmas, np.abs(rho), marker="o", label=r"$|\rho|$ (Pearson)")
ax[1].plot(sigmas, mi,         marker="s", label=r"$I(X;Y)$ (kNN, nats)")
ax[1].set_xlabel(r"noise $\sigma$"); ax[1].set_ylabel("dependence")
ax[1].legend(frameon=False)

plt.show()

Figure 1: Pearson correlation stays near zero across all noise levels, while mutual information correctly identifies the strong nonlinear dependence at low and decays as noise grows.

Figure 1 is the whole story. The left panel shows the parabolic relationship — about as obviously dependent as two variables get. On the right, hovers near zero across the entire noise range, while the kNN mutual-information estimator shows exactly the curve you’d hope for: large at low noise, decaying smoothly as grows.

Why this matters in practice

Two takeaways I keep coming back to:

Feature screening with correlation can silently miss nonlinear predictors. If your screening step is “drop features with ,” you’ve thrown away the parabola. Mutual-information screening is the right default for nonlinear models.
Independence testing is not the same as decorrelation. Useful reminder when you’re checking residuals or testing instrument exclusion: zero correlation does not buy independence, and does.

A sharper treatment of why is the unique measure satisfying a short list of natural axioms is in Cover & Thomas (2006) chapter 2; for the estimator used above, see Kraskov et al. (2004).

One liner

measures linear alignment; measures any statistical dependence at all.

References

Cover, T. M., & Thomas, J. A. (2006). Elements of information theory (2nd ed.). Wiley-Interscience.

Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69(6), 066138. https://doi.org/10.1103/PhysRevE.69.066138