Universal Inference: Review II

In Part I of this series, we covered some of the prerequisites for understanding the new paper Universal Inference by Wasserman et al. In particular, we reviewed the classical likelihood ratio test and worked some specific examples. In this post, we’ll look at the split likelihood ratio test which is the key idea behind universal inference and apply it to a Gaussian mixture example.

The Split Likelihood Ratio Test

Consider the random sample $\mathcal{D}= \left\{x_1,…,x_{2n} \right\}$ drawn from the distribution $f(x|\theta)$. Split $\mathcal{D}$ evenly into $\mathcal{D}_0$ and $\mathcal{D}_1$. Let $L_0$ be the likelihood function over the data $\mathcal{D_0}$, $$L_0(\theta|\mathcal{D}_0) = \prod_{i=1}^n f(x_i|\theta).$$

Define the Split Likelihood Ratio Statistic as: $$ U_n = \frac{L_0(\hat{\theta}_1)}{L_0(\hat{\theta}_0)}$$ where $\hat{\theta}_1$ is any estimator based on the data $\mathcal{D}_1$ and $\hat{\theta}_0$ is the MLE under $H_0$ using only the data in $\mathcal{D}_0$: $$\hat{\theta}_0 = \mathrm{arg\;max}_{\theta \in \Theta_0 } L_0(\theta).$$

A key result of universal inference is that for $\alpha$ the probability of a Type I error, when $H_0$ is true: $$p(U_n > \ \alpha) \leq \alpha.$$ That is, $U_n$ provably controls the probability of making a type I error. What’s more, the split likelihood ratio test is tractable in a far broader range of situations than the classical LRT and, where LRT it is computable, the split LRT vs the LRT does not seem to give up much statistical power.

Example: A Gaussian Mixture

Consider the simple two component Gaussian mixture: $$\psi(y; \mu) = \frac{1}{2}\phi(y; -\mu, 1) + \frac{1}{2}\phi(y; \mu, 1)$$ where $\phi(\cdot ; \mu, \sigma)$ is the normal density function with mean $\mu$ and standard deviation $\sigma$.

We want to test the hypothesis that $\mu>0$, i.e. that the true population distribution is a proper mixture of two components vs. $mu=0$ which corresponds to a standard normal distribution.

How can we test this hypothesis? Here are some facts collected in the Wasserman paper about this problem:

  • The classical likelihood ratio test does not have tractable limiting distribution.
  • There is an EM-based test found here for this hypothesis. By choosing appropriate penalty terms and other tuning parameters, this test does yield a tractable distribution. But the procedure is complicated and limited to 1D problems
  • The bootstrap test developed here originally is simple however there’s no proof that this test probably controls type I error

Universal inference on the other hand provides an approach that is simple and provably controls the type I error.

To construct the test, we need the following ingredients:

  • An estimator for the alternative hypothesis $\mu>0$. This can be any estimator. For our example, we’ll use an EM based estimation for a mixture with two components and common variance. We’ll fit this estimator on $\mathcal{D}_1$.
  • The maximum likelihood estimator under the null hypothesis. We’ll fit this estimator on $\mathcal{D}_0$.

To keep things as simple as possible, we’ll use an off-the-shelf implementation of the EM algorithm for fitting Gaussian Mixture Models. As in the Wasserman paper, we’ll just assume that these models converge to the MLE on simple problems but, though almost certainly true, it’s not an easy to prove fact. Here’s then a simple implementation of the universal inference test statistic:

from sklearn.mixture import GaussianMixture

def GMM(n_components, data):
    gmm = GaussianMixture(n_components=n_components, covariance_type='tied', tol=1e-5, max_iter=1000)
    return gmm

def split_lrt(data):
    D0 = data[0:len(data)//2]
    D1 = data[len(data)//2:]

    # likelihood over D0 under alternate model fitted to D1 -- numerator
    alt_model = GMM(2, D1)
    l01 = alt_model.score_samples(D0).sum()  # computes log-likelihood over D0
    # MLE likelihood over D0 under null model fitted to D0 -- denominator
    null_model = GMM(1, D0)
    l00 = null_model.score_samples(D0).sum() 

    return np.exp(l01 - l00)

To verify that this test behaves as we expect we can run several simulations. First, let’s check that the test statistic controls type I error for a fixed $\alpha$. For this, we need to show that when the null is true, the probability of rejecting the null hypothesis is less than $\alpha$.

import numpy as np
from scipy.stats import norm


alpha = 0.10
type_i_error = 0
n_trials = 10000
for i in range(n_trials):
    mu = 1
    x = sp.random.normal(size=200, loc=0.0, scale=1.0).reshape((-1,1))
    if split_lrt(x) > 1/alpha:

print(type_i_error / n_trials)

From this simulation, we estimated the probability of committing a type I error is approximately $0.0024$ which is less than the $\alpha$ value of $0.1$. We can verify similarly for other common $\alpha$ values.

Next we can verify that this test is reasonably good at discriminating between the null and alternative hypothesis. We do this by simulating the test’s power. For a fixed $\mu>0$, we need to estimate the probability that we reject the null hypothesis. For reference, we also compare the power of the universal test to the power of the bootstrap test. We explain briefly how the bootstrap test works after the code sample.

Power distribution for the universal test vs. bootstrap. Bootstrap test is slightly more powerful. Wasserman et al. point out that this is expected because the universal test uses only half the data as the bootstrap.
import matplotlib.pyplot as plt
power_u = []
power_b = []

def mixture_sample(n, mus, sigmas, weights=None):
    Sample from a mixture 
    k = len(mus)
    if k != len(sigmas) or (weights and len(weights)!=k):
        raise RuntimeError("mus and sigmas (and weights if provided) must have same length.")
    if weights is None:
        weights = [1.0/k for _ in range(k)]

    labels = sp.random.choice([i for i in range(k)], size=n, p=weights)
    y = [sp.random.normal(loc=mus[label], scale=sigmas[label]) for label in labels]
    return np.array(y).reshape((-1,1))

def compute_lr(x, gmm1, gmm2):
    ll1 = gmm1.score_samples(x).sum()
    ll2 = gmm2.score_samples(x).sum()
    return -2*(ll1 - ll2)

def lr_bootstrap(x, K, alpha):
    n = x.shape[0]
    j = min(int(np.ceil((1-alpha)*(K+1))), K)
    gmm1_base = GMM(1, x)
    gmm2 = GMM(2, x)
    lr = compute_lr(x, gmm1_base, gmm2)
    LRS_boot = []
    for i in range(K):
        B = gmm1_base.sample(n)[0].reshape((-1,1))
        # fit gmm's to the sample
        gmm1 = GMM(1, B)
        gmm2 = GMM(2, B)
        lrb = compute_lr(B, gmm1, gmm2)

    return lr, LRS_boot[j-1], sorted(LRS_boot)

#  Power simulation
mus = np.linspace(0, 3, 30)
alpha = 0.10
n_trials = 100

for mu in mus:
    reject_null_universal = 0
    reject_null_bootstrap = 0
    for i in range(n_trials):
        # draw a sample from the mixture
        x = mixture_sample(200, [-mu,mu], [1,1], [0.5, 0.5])
        b, crit, _ = lr_bootstrap(x, 19, alpha)
        if b > crit:
            reject_null_bootstrap += 1
        # splitlikelihood
        u = split_lrt(x)
        if u > 1/alpha:
            reject_null_universal += 1
# Plot power
plt.plot(mus, power_u, c='r', label='universal')
plt.plot(mus, power_b, c='b', label='bootstrap')
plt.legend(loc='upper left');

The bootstrap test for this problem is an elegant and simple procedure but must be used with some caution since there is no guarantee that it can control type I error in general. Following this paper, the procedure for the bootstrap is:

  • Compute the likelihood ratio $\lambda$ for the null vs. alternative hypothesis
  • Fit the null model to the original data $\mathcal{D}$. Using the values of this fit (e.g. $\mu = \hat{\mu}$) draw a sample of size $n = |\mathcal{D}|$ from the estimated distribution. Using this sample, compute a new likelihood ratio $\hat{\lambda}_i$
  • Repeat the above step $K$ times to get a bootstrapped sample $\{\hat{\lambda}_1, … \hat{\lambda}_K \}$ for the likelihood ratio under the null regime. Order this sample from smallest to largest.
  • The $j$-th largest value in the bootstrapped sample estimates the $j/(K+1)$-th quantile. So for $K=19$, $\hat{\lambda}_K$ estimates the $\alpha = 0.05$ critical value for the hypothesis test. When $\lambda > \hat{\lambda}_K$, we reject the null hypothesis.