Confidence intervals: What they are (not)

15 minute read


Recently at my work, I was asked to assist in determining the correct formula for determining a confidence interval of an unknown quantity from a bunch of random samples. In particular, given random samples drawn from two different normal distributions, the task was to determine the confidence interval for the ratio of the means of those two distributions. My coworker (whose job it was to present this information) had found a formula somewhere online by googling, but wasn’t sure it was correct and asked me to check it.

It is my job as the resident mathematician to help with these sorts of things. However, it has been well over a decade since I had done any meaningful amount of statistical analysis! I couldn’t make heads or tails of the formula he found. Not only that, I couldn’t even recall exactly what a confidence interval was!

Basic notions

The first thing I did was to google “confidence interval” and find a few primers on statistical analysis. On the most basic level, the concept of a confidence interval comes from the following.

Consider a collection of independently and identically distributed random variables $X_1,\dots,X_n$, each distributed according to a Gaussian distribution with some unknown mean $\mu$. After observing the values of some random samples for these variables, the task is to determine a real interval $[L,U]$ that represents a possible range of values for the true value of the mean $\mu$. The width of the interval should depend on the observed values but also on how confidence you want to be that this interval really does contain the true mean. This interval is the confidence interval.

Problems with the basic notions

However, being the pure mathematician that I am, I wanted to delve deep and get a true understanding for what a confidence interval “means.” I was a bit concerned by the standard interpretation that rudimentary statistics bookes give. A post of Statology gives an example where one wants to estimate the mean weight of a population of turtles given that the weights of a bunch of turtles are measured. After applying some formulas to the random samples, they state that the confidence interval has the following interpretation:

There is a 90% chance that the confidence interval of [293.91, 306.09] contains the true population mean weight of turtles.

As a mathematician, I found this interpretation appalling! Two reasons:

  1. Firstly, it is implied that there a conditional probability of 90% that the mean $\mu$ lies in the interval [293.91, 306.09], conditioned on the event that we measured these specific values for the weights of a random sample of turtles. But this is a conditional probability that is conditioned on an event of probability zero! (The event that we measured exactly these weights occurs with probability zero.) You can’t condition on events with zero probability. How should we interpret conditioning on an event of probability zero in this case?

  2. The true value of the mean $\mu$ is assumed to be some fixed (but unknown) value. Thus, given an interval $[L,U]$, either $\mu$ is in that interval or it is not. For example, if the true mean of the distribution is 300 then there is a 100% probability that the mean is contained in the interval [293.9,306.09]. But if the true mean were 310, then the interval in question has a 0% probability of containing the mean. What could possibly be meant by saying that “there is a 90% chance” that the interval contains the mean?

I concluded that this interpretation from Statology is surely not the correct interpretation of a confidence interval. So I started doing a deep dive…

A more rigorous definition of confidence intervals

I started looking at a few basic statistics textbooks, but the interpretations in those books either had the same pitfalls as the one described above, or were too complicated for me to understand. The best deep dive always begins with Wikipedia, which turned out to be surprisingly useful.

Based on the definition provided in Wikipedia, I devised my own definition of a confidence interval.

Statistical model

First, we need a concept of a family of distributions. (A probability distribution is another term for a probability measure.)

Definition. A statistical model for a measure space $(\mathcal{X},\mathcal{E})$ is a family \(\{P_\theta\,:\,\theta\in\Theta\}\) of probability distributions $P_\theta$ on $(\mathcal{X},\mathcal{E})$ for each $\theta$ in some index set $\Theta$.

The most common example of a statistical model is the normal family \(\{\mathcal{N}_{\mu,\sigma}\,:\,(\mu,\sigma)\in\mathbb{R}\times(0,\infty]\},\) which is the family of normal distributions on $\mathbb{R}$, where for each choice of parameters $\theta$ and $\sigma$, the distribution is defined by \(\mathcal{N}_{\theta,\sigma}(A) =\frac{1}{\sigma\sqrt{2\pi}}\int_A\exp\left(\frac{(x-\theta)^2}{2\sigma^2}\right)\,\mathrm{d}x\) for every Borel subset $A\subset\mathbb{R}$.

Measurable set-valued functions

We next need a notion of measurable set-valued functions. For a set $\mathcal{Y}$, we let $\mathcal{P}(\mathcal{Y})$ denote the power set of $\mathcal{Y}$ (i.e., the set of all subsets).

Definition. Let $(\mathcal{X},\mathcal{E})$ be a measurable space and let $\mathcal{Y}$ be a nonempty subset. A set-valued mapping $f:\mathcal{X}\to\mathcal{P}(\mathcal{Y})$ is said to be $\mathcal{E}$-measurable if, for each choice of $y\in\mathcal{Y}$, the set \(\{x\in\mathcal{X}\,:\, y\in f(x)\}\subset \mathcal{X}\) is $\mathcal{E}$-measurable.

As an example, let $g_1:\mathcal{X}\to \mathbb{R}$ and $g_2:\mathcal{X}\to \mathbb{R}$ be Borel-measurable functions and define $f:\mathbb{R}\to\mathcal{P}(\mathbb{R})$ as \(f(x) = \{y\in\mathbb{R}:g_1(x)<y<g_2(x)\}\) for every $x\in\mathbb{R}$. (That is, for each choice of real number $x$, the set $f(x)$ is the open interval $(g_1(x),g_2(x))$). Because $g_1$ and $g_2$ are Borel-measureable, the sets \(g_1^{-1}\big((-\infty,y)\big) = \{x\in\mathcal{X}:g_1(x)<y \}\) and \(g_2^{-1}\big((y,+\infty)\big) = \{x\in\mathcal{X}:y<g_2(x) \}\) are measurable for each $y\in\mathbb{R}$, and thus so is the set \(g_1^{-1}\big((-\infty,y)\big)\cap g_2^{-1}\big((y,+\infty)\big)=\{x\in\mathcal{X}:y\in f(x)\}.\) If $X$ is an $(\mathcal{X},\mathcal{E})$-valued random variable, then $f(X)$ is a random interval. This is how one constructs confidence intervals for parameters of distributions.

Confidence sets

We can now define a general concept of a confidence set for a statistical model. Let ${P_\theta\,:\,\theta\in\Theta}$ be a statistical model for a measure space $(\mathcal{X},\mathcal{E})$. A mapping $g:\Theta\to\mathcal{Y}$ from the parameter set $\Theta$ to some non-empty set $\mathcal{Y}$ is an estimand for the parameter space. Namely, we are typically only interested in estimating certain properties of the parameter $\theta\in\Theta$ rather than the entire value of the parameter itself.

Definition. Let $g:\Theta\to\mathcal{Y}$ be an estimand of a statistical model ${P_\theta\,:\,\theta\in\Theta}$ for some measure space $(\mathcal{X},\mathcal{E})$. For a real value $\alpha\in[0,1]$, an $\alpha$-level confidence set for the estimand $g$ is a $\mathcal{E}$-measurable set-valued mapping $f:\mathcal{X}\to\mathcal{P}(\mathcal{Y})$ such that \(P_\theta\big(\{x\in\mathcal{X}\,:\, g(\theta)\in f(x)\}\big) \geq 1-\alpha\) holds for every choice of parameter $\theta\in\Theta$.

Because $f$ is $\mathcal{E}$-measurable, the sets ${x\in\mathcal{X}\,:\, y\in f(x)}$ are $\mathcal{E}$-measurable for each $\theta\in\Theta$ and thus these sets can be assigned a probability for each probability measure $P_\theta$.

Note that $P_\theta\big({x\in\mathcal{X}\,:\, g(\theta)\in f(x)}\big)$ represents the probability that the estimand $g(\theta)$ is contained within the estimating set $f(x)$. If $X$ is $(\mathcal{X},\mathcal{E})$-valued random variable then $f(X)$ is a random set and \(\Pr(g(\theta)\in f(X))=P_\theta\big(\{x\in\mathcal{X}\,:\, g(\theta)\in f(x)\}\big),\) supposing that $X$ is distributed according to $P_\theta$.

Typically onw desires an $\alpha$-level confidence set such that the probabilities are equal to $1-\alpha$ for every $\theta\in\Theta$. This is ho we will construct confidence intervals later on.

Confidence intervals

Let ${P_\theta\,:\,\theta\in\Theta}$ be a statistical model for some measurable space $(\mathcal{X},\mathcal{E})$ and let $g:\Theta\to\mathbb{R}$ be an estimand for the parameter space. Confidence sets for such real-valued estimands typically take the form of intervals (as shown above). A confidence interval $f:\mathcal{X}\to\mathcal{P}(\mathbb{R})$ for this estimand can be constructed by taking two Borel-measurable functions $g_1:\mathcal{X}\to \mathbb{R}$ and $g_2:\mathcal{X}\to \mathbb{R}$ satisfying $g_1(x)<g_2(x)$ for each $x\in\mathcal{X}$ and defining $f(x)$ to be the interval $(g_1(x),g_2(x))$ for each $x\in\mathcal{X}$.

Interpretation of confidence sets

Let us now consider the proper interpretation of a confidence set. Let ${P_\theta:\theta\in\Theta}$ be a statistical model of candidate probability distributions for some $(\mathcal{X},\mathcal{E})$-valued random variable $X$. That is, the true probability distribution of $X$ is given by $P_{\theta_0}$ for some unknown (but fixed) choice of parameter $\theta_0\in\Theta$. Let $g:\Theta\to\mathcal{Y}$ be an estimand such that we wish to estimate the true value of $g(\theta_0)$. For a choice of confidence level $\alpha\in[0,1]$, we choose a confidence set $f:\mathcal{X}\to\mathcal{P}(\mathcal{Y})$ such that $f(X)$ is a random set.

Perform now the following experiment. Randomly choose a value of $x\in\mathcal{X}$ according to the distribution $P_{\theta_0}$, and from this randomly chosen value construct the corresponding set $f(x)\subset\mathcal{Y}$. By construction, the probability that this experiment will produce a set containing $g(\theta_0)$ is at least $1-\alpha$. That is, if this experiment were performed repeatedly, the proportion of times that the constructed set will contain $g(\theta_0)$ is at least $1-\alpha$. This is the proper interpretation of a confidence set!

Note that, after performing the experiment, we obtain a single value of $x$ for $X$. Either the set $f(x)$ contains $g(\theta_0)$ or it does not. Recall the wrong interpretation of confidence sets that was mentioned at the start of this blog post. It is not correct to say that $f(x)$ contains $g(\theta_0)$ with probability $1-\alpha$. Indeed, the post-facto “probability” that the constructed $f(x)$ contains $g(\theta_0)$ at the end of the experiment is either 1 ot 0 (depending on whether the true value is contained in the set or not).

It is better to say that “the probability that this procedure constructed a set containing the true value of the estimand is at least $1-\alpha$.”

Confidence intervals for normally distributed random samples

We now go through the standard process of constructing confidence intervals for the mean of a normal distributed random variable from a random sample of independent and identical normally distributed random variables.

For the family of normal distributions${\mathcal{N}_{\mu,\sigma}:(\mu,\sigma)\in\mathbb{R}\times(0,\infty)}$ with parameter space $\Theta = \mathbb{R}\times(0,\infty)$, suppose we wish to estimate the mean. Our estimand is the real-valued function $g:\mathbb{R}\times(0,\infty)\to\mathbb{R}$ defined as $g(\mu,\sigma)=\mu$.

Let $X=(X_1,X_2,\dots,X_n)$ for some i.i.d. normally distributed random variables $X_1,X_2,\dots,X_n$, each distributed according to $\mathcal{N}_{\mu_0,\sigma_0}$ for some fixed (but unknown) mean $\mu_0$ and varialce $\sigma_0{}^2$. We wish to construct a confidence interval for $\mu_0$. Define the random variables \(\overline{X}=\frac{1}{n}\sum_{i=1}^nX_i\) and \(S = \sqrt{\frac{\sum_{i-1}^n(\overline{X}-X_i)^2}{n-1}},\) which are the sample mean and sample standard devition, respectively. Note that the expected values of these random variables are $\operatorname{E}(\overline{X})=\mu_0$ and $\operatorname{E}(S)=\sigma$. Moreover, it holds that $\overline{X}$ is normally distributed with mean $\mu_0$ and variance $\sigma_0/\sqrt{n}$, while $S$ is distributed according to the $\chi$-distribution with $n-1$ degrees of freedom. Define a further random variable \(T=\frac{\overline{X}-\mu_0}{S/\sqrt{n}},\) which is distributed according to Student’s $t$-distribution with $n-1$ degrees of freedom. That is, for every Borel subset $A\subset\mathbb{R}$, one has that \(\Pr(T\in A) = \int_A f_{n-1}(x)\,\mathrm{d}x\) where for each $k\in\mathbb{N}$, the probability density function $f_k:\mathbb{R}\to\mathbb{R}$ for Student’s $t$-distribution is defined by \(f_k(x) = \frac{\Gamma(\frac{k+1}{2})}{\sqrt{k\pi}\Gamma(\frac{k}{2})}\left(1+\frac{x^2}{k}\right)^{-\frac{k+1}{2}}\) for each $x\in\mathbb{R}$. Note that the corresponding cumulative density function $F_k:\mathbb{R}\to(0,1)$ defined by \(F_k(x)=\int_{-\infty}^x f_k(t)\,\mathrm{d}t\) for each $x\in\mathbb{R}$ is strictly increasing and is therefore invertible on the interval $(0,1)$.

For each $k\in\mathbb{N}$ and $\gamma\in\mathbb{R}$, define $\tau_{k,\gamma}=F_{k}^{-1}(\gamma)$ such that \(\begin{split}\Pr\left(\overline{X}-\frac{\tau_{n-1,\gamma}S}{\sqrt{n}}< \mu_0\right) & = \Pr\left(\frac{\overline{X}-\mu_0}{S/\sqrt{n}}\leq \tau_{n-1,\gamma}\right)\\ & = \Pr\left(T\leq \tau_{n-1,\gamma}\right)\\&= F_{n-1}(\tau_{n-1,\gamma})\\&=\gamma.\end{split}\) Similarly, note that \(\begin{split}\Pr\left(\overline{X}-\frac{\tau_{n-1,\gamma}S}{\sqrt{n}}> \mu_0\right) & = 1- \Pr\left(T\leq\tau_{n-1,\gamma}\right)\\ & = 1-\gamma.\end{split}\)

We may now define the confidence intervals. Let $\alpha\in(0,1)$ be a desired confidence level and define real-valued functions $g_1:\mathbb{R}^n\to\mathbb{R}$ and $g_2:\mathbb{R}^n\to\mathbb{R}$ as \(g_1(X) = \overline{X} - \frac{\tau_{n-1,\alpha/2}S}{\sqrt{n}}\qquad\text{and}\qquad g_1(X) = \overline{X} - \frac{\tau_{n-1,1-\frac{\alpha}{2}}S}{\sqrt{n}},\) each of which is clearly Borel. Then $(g_1(X),g_2(X))$ is the desired random interval, as \(\begin{split}\Pr\big(g_1(X)< \mu_0 <g_2(X)\big) &= \Pr\big( \mu_0 <g_2(X)\big) - \Pr\big(g_1(X)\geq\mu_0 \big)\\ & =1-\frac{\alpha}{2} - \frac{\alpha}{2}\\ & = 1-\alpha.\end{split}\)

Confidence intervals for ratio of means

We now arrive at the confidence interval that I was asked to derive for my work.

Suppose $X_{0,1},X_{0,2},\dots,X_{0,n_0}$ and $X_{1,1},X_{1,2},\dots,X_{1,n_1}$ are independent random variables such that $X_{0,i}$ is normally distributed with mean $\mu_0$ and variance $\sigma_0{}^2$ for each $i\in{1,\dots,n_0}$ and $X_{1,j}$ is normally distributed with mean $\mu_1$ and variance $\sigma_1{}^2$ for each $j\in{1,\dots,n_1}$. The estimand that we wish to estimate in this case is the ratio of the means $\rho=\frac{\mu_0}{\mu_1}$.

As before, define the sample mean random variables \(\overline{X}_0 = \frac{1}{n_0}\sum_{i=1}^{n_0}X_{1,i}\qquad \text{and}\qquad \overline{X}_1=\frac{1}{n_1}\sum_{j=1}^{n_1}\) and sample standard deviation random variables \(S_0 = \sqrt{\frac{\sum_{i-1}^n(\overline{X}_0-X_{0,j})^2}{n_0-1}}\qquad\text{and}\qquad S_1 = \sqrt{\frac{\sum_{j=1}^{n_1}(\overline{X}_1-X_{1,j})^2}{n_1-1}}.\) Define further the $t$-statistic random variables as \(T_0=\frac{\overline{X}_0-\mu_0}{S_0/\sqrt{n_0}}\qquad\text{and}\qquad T_1=\frac{\overline{X}_1-\mu_1}{S_1/\sqrt{n_1}},\) which are independent and each distributed according the Student’s $t$-distribution with $n_0-1$ and $n_1-1$ degrees of freedom, respectively.

Define now a real-valued function $h$ as \(\begin{split}h(\gamma,x_0,s_0,x_1,s_1) &= \int_{-\infty}^\frac{x_1}{s_1/\sqrt{n_1}}\left(1-F_{n_0-1}\left(\frac{x_0-\gamma(x_1-s_1t/\sqrt{n_1})}{s_0/\sqrt{n_0}}\right)\right)f_{n-1}(t)\mathrm{d}t\\ & \qquad + \int_{\frac{x_1}{s_1/\sqrt{n_1}}}^\infty F_{n_0-1}\left(\frac{x_0-\gamma(x_1-s_1t/\sqrt{n_1})}{s_0/\sqrt{n_0}}\right)f_{n_1-1}(t)\mathrm{d}t \end{split}\) for each $\gamma,x_0,x_1,s_0,s_1\in\mathbb{R}$ where $s_0$ and $s_1$ are positive. For fixed choices of values for $x_0$, $x_1$, $s_0$, and $s_1$, this function is strictly increasing in $\gamma$ and satisfies \(\lim_{\gamma\to-\infty}h(\gamma,x_0,s_0,x_1,s_1)=0\qquad\text{and}\qquad \lim_{\gamma\to+\infty}h(\gamma,x_0,s_0,x_1,s_1)=0.\) One may therefore define a real-valued function $g$ as follows. For each $\alpha\in(0,1)$ and every choice of values $\gamma,x_0,x_1,s_0,s_1\in\mathbb{R}$ such that $s_0$ and $s_1$ are positive, the value of \(g(\alpha,x_0,s_0,x_1,s_1)\) is equal to the unique value $\gamma\in\mathbb{R}$ such that 1 $h(\gamma,x_0,s_0,x_1,s_1)=\alpha$, and define the random variable $C_\alpha$ as \(C_\alpha = g(\alpha,\overline{X}_0,S_0,\overline{X}_1,S_1).\) By construction, it holds that \(h(C_\alpha,\overline{X}_0,S_0,\overline{X}_1,S_1) = \alpha\) and thus \(\Pr\left(\frac{\mu_0}{\mu_1}\leq C_\alpha\right) = \operatorname{E}\left(h(C_\alpha,\overline{X}_0,S_0,\overline{X}_1,S_1)\right) = \alpha.\) Now, for each $\alpha\in(0,1)$, one has that \(\begin{split}\Pr\left(C_{\frac{\alpha}{2}}<\frac{\mu_0}{\mu_1}<C_{1-\frac{\alpha}{2}}\right) &= \Pr\left(\frac{\mu_0}{\mu_1}<C_{1-\frac{\alpha}{2}}\right) - \Pr\left(\frac{\mu_0}{\mu_1}<C_{\frac{\alpha}{2}}\right) \\ &=1-\frac{\alpha}{2} - \frac{\alpha}{2}\\&=1-\alpha.\end{split}\) That is, for a given $\alpha\in(0,1)$, we have a confidence interval $(C_{\frac{\alpha}{2}},C_{1-\frac{\alpha}{2}})$ for $\frac{\mu_0}{\mu_1}$ with confidence level $\alpha$.

Code for constructing confidence intervals

Here is some code for implementing this process to compute the confidence intervals from sample data.

You can find the code in my colab notebook (with an application of it applied to a randomly generated set of sample data) here.

from scipy.stats import t
import scipy.integrate
import scipy.optimize
import numpy as np
import matplotlib.pyplot as plt
import statistics
import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
# Compute confidence interval for ratio of means

def integrand(t_, rho, mx, sx, nx, my, sy, ny):
  # t_ is the variable of integration
  # The other terms (rho, mx, sx, nx, my, sy, ny) are the parameter arguments
  #  - nx and ny are the number of samples of X and Y
  #  - mx and my are the sample meams (i.e., x = (X1+X2+...+Xn)/n)
  #  - sx and sy are the unbiased sample standard deviations
  #     (i.e., sx = 1/(n-1) * [sum of (Xi - x)^2]
  #  - rho is the estimate for the ratio of the true means of X and Y
  #     (i.e., rho = E(X)/E(Y))
  A = mx - rho * (my - t_ * sy / np.sqrt(ny))
  B = sx/np.sqrt(nx)

  il = A/B                  # inner limit of integration
  ol = np.sqrt(ny) * my / sy # outer limit of integration

  if t_ <= ol:
    inner_prob = 1 - scipy.stats.t.cdf(il,nx-1)
  elif t_ > ol:
    inner_prob = scipy.stats.t.cdf(il,nx-1)

  return inner_prob * scipy.stats.t.pdf(t_,ny-1)

def pvalue(rho, mx, sx, nx, my, sy, ny):
  # Returns the probability that the true mean is less than rho
  return scipy.integrate.quad(integrand, -np.inf, np.inf, args=(rho, mx, sx, nx, my, sy, ny))[0]

def pvalue_offset(rho, mx, sx, nx, my, sy, ny, gamma):
  return pvalue(rho, mx, sx, nx, my, sy, ny) - gamma

def get_confidence_interval(alpha, mx, sx, nx, my, sy, ny):
  # alpha is the confidence level (i.e., should be .9 or .95)
  confint = [scipy.optimize.root(pvalue_offset, mx/my, \
                                 args=(mx, sx, nx, my, sy, ny, gamma)\
                                 ).x[0] for gamma in [(1-alpha)/2, (1+alpha)/2]]
  return confint
  1. That is, it holds that $h(g(\alpha,x_0,s_0,x_1,s_1),x_0,s_0,x_1,s_1) = \alpha.$