Universal Inference: Review Part I

There’s an important new paper out by Larry Wasserman et al. that describes a very general technique, called Universal Inference, for constructing statistical hypothesis tests and confidence intervals. In the traditional theory of statistics, such as would be taught in an undergraduate mathematical statistics course, a standard way hypothesis tests are constructed and analyzed is through the Likelihood Ratio Statistic. These derivations require strict asymptotic regularity assumptions (e.g. that the estimator is asymptotically Gaussian distributed) and thus the approach is quite limiting.

What Wasserman et al. show is that by making an elegant modification to the likelihood ratio, the strict regularity assumptions can be discarded and we can still obtain powerful hypothesis tests and confidence intervals. In fact, all that is needed is the existence of a Maximum Liklihood Estimator (MLE).

I learned about this paper from this blog post via R-bloggers.

In this two part series:

  • We’ll review the classical theory of the likelihood ratio test and illustrate with some simple examples
  • We’ll summarize parts of Universal Inference and work some numerical examples

The Classical Theory: The Likelihood Ratio Test

Let’s briefly review the classical Likelihood Ratio Test. We’ll follow closely Chapter 8 of Casella and Berger but the treatment should be similar in any mathematical statistics text.

Suppose that we have collected the data $\vec{x} = (x_1,…,x_n)$ and we have a statistical model $f(x | \theta)$ parameterized by $\theta$. We would like to develop a statistic that allows us test a null hypothesis $$H_0: \theta \in \Theta_0$$ against an alternative $$H_a: \theta \in \Theta_0^c.$$ A traditional construction of a test statistic for this hypothesis begins with the likelihood ratio: $$\lambda(\vec{x}) = \frac{L(\hat{\theta}_0 | \vec{x}) } {L(\hat{\theta}| \vec{x})}$$ where

  • $L$ is the likelihood function $$L(\theta|\vec{x}) = \prod_{i=1}^n f(x_i|\theta)$$
  • $\hat{\theta}$ denotes the maximum likelihood estimator (MLE) of $\theta$, i.e. $$\hat{\theta} = \mathrm{arg\;max}_{\theta} L(\theta | \vec{x})$$
  • $\hat{\theta}_0$ denotes the MLE where the maximization is restricted to $\Theta_0$, i.e.$$\hat{\theta}_0 = \mathrm{arg\;max}_{\theta \in \Theta_0} L(\theta | \vec{x})$$

A test of the hypothesis in terms of $\lambda$ will then take the form:

Reject $H_0$ when $\lambda(x) \leq c$ where $0 \leq c \leq 1$.

The usefulness of statistical tests is often characterized by the probabilities of making Type I and Type II errors. The probability of making a Type I error (rejecting the null hypothesis when it’s true) is usually denoted by $\alpha$ and in applications we’ll often want to hold it a fixed value e.g. $\alpha=0.01$. The probability of committing a Type II error (not rejecting the null when the alternative is true) is usually denoted by $\beta$. For a given a true value $\theta$, the power of a test is $$\mathrm{Power}(\theta) = 1 – \beta(\theta)$$ which is the probability that we reject $H_0$ when $\theta \in \Theta_{0}^c$ is the true population parameter value.

Example: A Single Gaussian Distribution

This example is adapted from Example 8.2.1 of Casella and Berger. Let $\vec{x} = (x_1,…,x_n)$ be drawn from the normal distribution $N(\theta, 1)$. Suppose we want to test the null hypothesis that $\theta = 0$ against the alternative $\theta \neq 0$.

The MLE for $\theta$ is, not surprisingly, the sample mean $$\hat{\theta} = \bar{\vec{x}} = \frac{1}{n} \sum_{i=0}^n x_i$$ and $$L(\hat{\theta}|\vec{x}) =\frac{1}{(2 \pi)^{n/2}}\prod_{i=0}^n e^{-(x_i – \bar{x})^2/2} = \frac{1}{(2 \pi)^{n/2}} e^{-\sum_{i=0}^n (x_i – \bar{x})^2/2}$$

Since $0$ is the only element in $\Theta_0$, the maximum likelihood estimator over $\Theta_0$ is just $0$. So $$L(\hat{\theta}_0 | \vec{x}) = \frac{1}{(2 \pi)^{n/2}} e^{-\sum_{i=0}^n x_i^2/2}.$$ Then we simplify to obtain: $$\lambda(x) = e^{-n \bar{\vec{x}}^2/2}.$$ For fixed $0 \leq c \leq 1$, we reject $H_0$ when: $$|\bar{\vec{x}}| \leq \sqrt{-2 \mathrm{ln}(c)/n}.$$

Suppose we want to fix the probability of making a Type I error. In this case, we assume $H_0$ is true and choose $c$ so that $$P(\lambda(x) \leq c) = \alpha$$ But this is easy to do since when $H_0$ is true, $\sqrt{n} \bar{\vec{x}} \sim N(0,1)$. So by taking $$c = e^{-z_{\alpha/2}^2 / 2}$$ where $z_{\alpha/2}$ is the critical value for the 2-sided z-test at level $\alpha$.

For a given $\theta$, the power function of this test is $$\mathrm{Power}(\theta) = p( \lambda(x) \leq c) = p(|\bar{\vec{x}}| \geq c) = p(x \geq c) + p(-x \geq c).$$ Because $x \sim N(\theta, 1/\sqrt{n})$, we have $$z_1=\sqrt{n}( \bar{\vec{x}} – \theta) \sim N(0, 1)$$ and $$z_2 = -\sqrt{n}(\theta + \bar{\vec{x}}) \sim N(0,1).$$ We can then write the power function as: $$\mathrm{Power}(\theta) = p(z_1 \geq \sqrt{n}(c – \theta)) + p(z_2 \geq -\sqrt{n}(c + \theta)) = 2 – \Phi(\sqrt{n}(c – \theta)) – \Phi(\sqrt{n}(c + \theta))$$ where $\Phi$ is the cumulative density function of the standard normal.

We can plot the power function for this example:

import matplotlib.pyplot as plt
from scipy.stats import norm
from numpy import exp, sqrt, linspace

def power(n, c, theta):
    t1 = sqrt(n)*(c + theta)
    t2 = sqrt(n)*(c - theta)
    return 2 - norm.cdf(t1) - norm.cdf(t2)

# constants
n = 200
z_a = norm.ppf(0.01)
c = exp(-z_a**2/2)

x = linspace(-1, 1, 1000)
y = [power(n, c, theta) for theta in x ]

# plot
plt.rcParams['figure.dpi'] = 150
plt.plot(x, y, c='r', linewidth=4);
plt.title(r'Power of LRT with $\alpha=0.01$, $H_0: \theta =0$ and $H_a: \theta \neq 0$');

Example: A Gaussian Mixture

A Gaussian mixture distribution is a deceptively simple statistical model. We can construct one as follows:

  • Let $N(x | \mu_i, \sigma_i)$ for $i=1,…m$ be a normal distribution with mean $\mu_i$ and $\sigma_i$
  • Let $w_i$ be a set of weights such that $w_i > 0$ and $\sum_{i=0}^m w_i = 1$

Define: $$P(x | \vec{\mu}, \vec{\sigma}) = \sum_{i=0}^m w_i N(x | \mu_i, \sigma_i).$$

It’s easy verify that $P$ is a probability distribution and that it is typically multi-modal. We can also easily draw a random sample from $P$ by repeating the following algorithm:

  1. First choose a random component distribution by sampling the labels $i \in \{1,…m\}$ according to the weights $w_i$.
  2. From the selected distribution (which is just a regular Gaussian distribution), draw a single sample.
An example of a Gaussian mixture distribution
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp

def mixture_sample(n, mus, sigmas, weights=None):
    Sample from a mixture 
    k = len(mus)
    if k != len(sigmas) or (weights and len(weights)!=k):
        raise RuntimeError("mus and sigmas (and weights if provided) must have same length.")
    if weights is None:
        weights = [1.0/k for _ in range(k)]

    labels = sp.random.choice([i for i in range(k)], size=n, p=weights)

    y = [sp.random.normal(loc=mus[label], scale=sigmas[label]) for label in labels]
    return y

# draw a large sample from the mixture
y = mixture_sample(100000, [-2,2], [1,1])
plt.hist(y, bins=100, density=True, alpha=0.75);

# overlay with a plot of the density function for the mixture model
x = np.linspace(-8, 8, 10000)
yy = (0.5)*norm.pdf(x, loc=-2, scale=1) + (0.5)*norm.pdf(x, loc=2, scale=1)

# plot
plt.rcParams['figure.dpi'] = 150
plt.plot(x,yy, c='r', linewidth=4);
plt.title(r'Gaussian mixture model with $\vec{\mu} = (-2,2)$, $\vec{\sigma}=(1,1)$');

Let $$\psi(y; \mu) = \frac{1}{2}\phi(y; -\mu, 1) + \frac{1}{2}\phi(y; \mu, 1)$$ where $\phi(\cdot ; \mu, \sigma)$ is the normal density function with mean $\mu$ and standard deviation $\sigma$.

Suppose we observe $Y_1,…,Y_{2n} \sim P$ where $P$ is distribution. Consider the following statistical hypothesis test: $$H_0: P = \psi(\cdot; 0, 1) \qquad H_{a}: P = \psi(\cdot; \mu, 1).$$ This test corresponds to testing the hypothesis that our data was drawn from a mixture vs a standard normal distribution. How can we construct a test for this situation? Given the relative ease with which we constructed the test for the normal distribution in the previous example, it seems natural that we could appeal to the likelihood ratio test again.

Surprisingly, however, computing the likelihood ratio is not tractable in this situation. For example, computing the MLE in the denominator of the likelihood ratio is a highly non-trivial matter. And this is where the new technique of Universal Inference can come to save the day…