Skip to article frontmatterSkip to article content

A Short Introduction to Bayesian Inference

Oregon State University

Bayesian Statistics

The process of Bayesian data analysis can be idealized by dividing it into the following three steps (from Gelman et al. (2013)) :

  1. Setting up a full probability model---a joint probability distribution for all observable and unobservable quantities in a problem. The model should be consistent with knowledge about the underlying scientific problem and the data collection process.

  2. Conditioning on observed data: calculating and interpreting the appropriate posterior distribution—the conditional probability distribution of the unobserved quantities of ultimate interest, given the observed data.

  3. Evaluating the fit of the model and the implications of the resulting posterior distribution: how well does the model fit the data, are the substantive conclusions reasonable, and how sensitive are the results to the modeling assumptions in step 1? In response, one can alter or expand the model and repeat the three steps.

Notation

  • θ : unobservable parameters that we want to infer
  • yy : observed data
  • y~\tilde{y} : unobserved but potentially observable quantity
  • x\mathbf{x} : vector quantity always assumed to be a column vector
  • p()p ( \cdot ) : probability density function or discrete probability (from context)
  • p()p ( \cdot \mid \cdot ) : conditional probability

Bayes’ Theorem

In order to make probability statements about θ given yy, we must build a model that provides a joint probability distribution for θ and yy. We can write the joint distribution as a product of two distributions:

  • the prior p(θ)p(\theta) and
  • the sampling distribution p(yθ)p(y \mid \theta). For fixed yy we call this the likelihood
p(θ,y)=p(yθ)p(θ).p(\theta, y) = p(y \mid \theta) p(\theta).

Assuming we are given p(y)p(y) instead, we could write

p(θ,y)=p(θy)p(y).p(\theta, y) = p(\theta \mid y) p(y).

Putting these together, we arrive at Bayes’ Theorem:

p(θy)=p(yθ)p(θ)p(y).p(\theta \mid y) = \frac{p(y \mid \theta) p(\theta)}{p(y)}.

Typically, computing p(y)=dθp(θ)(yθ)p(y) = \int d\theta\, p(\theta)(y \mid \theta) in (3), called the evidence, is intractable. However, it is a constant. Consequently, we often write Bayes’ Theorem as a proportional relation:

p(θy)p(yθ)p(θ).p(\theta \mid y) \propto p(y \mid \theta) p(\theta).

Example from Genetics

This example is also from Gelman et al. (2013).

Biological human males have one X-chromosome and one Y-chromosome, whereas females have two X-chromosomes, each chromosome being inherited from one parent. Hemophilia is a disease that exhibits X-chromosome-linked recessive inheritance, meaning that a male who inherits the gene that causes the disease on the X-chromosome is affected, whereas a female carrying the gene on only one of her two X-chromosomes is not affected. The disease is generally fatal for women who inherit two such genes, and this is rare, since the frequency of occurrence of the gene is low in human populations.

Prior distribution: Consider a woman who has an affected brother, which implies that her mother must be a carrier of the hemophilia gene with one “good” and one “bad” hemophilia gene. Her father is not affected; thus the woman herself has a fifty-fifty chance of having the gene. Therefore, the woman is either a carrier of the gene (θ=1)(\theta = 1) or not (θ=0)(\theta = 0). The prior distribution for the unknown θ is then

p(θ=1)=p(θ=0)=0.5.p(\theta = 1) = p(\theta = 0) = 0.5 .

Data model and likelihood: The data will be the the status of the woman’s sons. Let yi=1y_i = 1 or 0 denote an affected or unaffected son, respectively. Suppose she has two sons, neither of whom is affected. Assume that there is no covariance between the sons’ conditions. The likelihood is then

p(y1=0,y2=0θ=1)=0.50.5=0.25p(y_1 =0, y_2 = 0 \mid \theta = 1) = 0.5 \cdot 0.5 = 0.25
p(y1=0,y2=0θ=0)=11=1p(y_1 =0, y_2 = 0 \mid \theta = 0) = 1 \cdot 1 = 1

Posterior distribution: We can now use Bayes’ Theorem in the form (3) to find the posterior probability that the woman has the gene. Let y=(y1,y2)y = (y_1, y_2).

Solution to Exercise 1 #
p(θ=1y)=p(yθ=1)p(θ=1)p(yθ=1)p(θ=1)+p(yθ=0)p(θ=0)=0.250.50.250.5+10.5=0.20.\begin{align} p(\theta = 1 \mid y) &= \frac{p ( y \mid \theta = 1) p(\theta = 1)}{p(y \mid \theta = 1)p(\theta=1) + p (y \mid\theta =0) p(\theta=0)} \\ &= \frac{0.25 \cdot 0.5}{0.25 \cdot 0.5 + 1 \cdot 0.5} \\ &= 0.20. \end{align}

Updating Our Knowledge: Assume we find a long-lost third son that also does not carry the gene. We can use the the posterior from the previous analysis as the prior for our new analysis with the third son included.

Solution to Exercise 2 #

Using the formula from (8)

p(θ=1y)=0.50.200.50.20+10.8=0.111.\begin{align} p(\theta = 1 \mid y) &= \frac{0.5 \cdot 0.20}{0.5 \cdot 0.20 + 1 \cdot 0.8 }\\ &= 0.111. \end{align}

Building Bayesian Models with Probabilistic Programming Languages

Let’s switch over from genetics to astronomy and learn how to infer the value of the Hubble constant. I’ve prepared some data at h_z_measurements.feather.

First, lets import the packages we’ll need for this section.

import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt
import arviz as az
import numpyro
import pandas as pd
from numpyro import distributions as dist
from numpyro import infer
df = pd.read_feather("h_z_measurements.feather")
df
Loading...
_ = plt.errorbar(
    df["z"], df["H_z"], yerr=df["H_z_err"], fmt="o", markersize=4
)
plt.xlabel("z")
plt.ylabel(r"$H(z)\ \left[km / s/Mpc\right]$")
plt.tight_layout()
<Figure size 640x480 with 1 Axes>

This is inspired by the recent DESI release. We could of course just look at z=0z=0, so this data is overkill. We will expand on this model later to infer other parameters.

To build our model, we need to gather all the information we have about the Hubble parameter as a function of redshift. The Friedmann equations (with some extra terms we’ll talk about later) lead to

H(z)=H0(Ωm(1+z)3+Ωr(1+z)4+ΩΛ(1+z)3(1+w0+wa)e3wa(z1+z))1/2.H(z) = H_0 \left(\Omega_m (1+z)^3 + \Omega_r (1+z)^4 + \Omega_{\Lambda} (1+z)^{3 (1+ w_0 +w_a)} e^{-3 w_a \left(\frac{z}{1+z}\right)}\right)^{1/2}.
  • To start with, we’ll assume all parameters take their ΛCDM\Lambda CDM values and infer just on H0H_0.

  • Our data are the H(z)H(z) values observed at the corresponding redshifts. Call them HobsH_{obs}

  • The data looks to have some random noise scattered around a trend. We’ll choose a Gaussian likelihood to account for this.

    • We need a forward model for our expected H(z)H(z) value that is a function of redshift and H0H_0. Let’s create that.
def hz_forward(z, h0=67.0, omega_matter=0.3, omega_de=0.7, omega_rad=5.5e-5, w0=-1.0, wa=0.0):
    """Hubble parameter as function of redshift. Default parameters are LCDM."""
    a_inv = 1+z
    return h0 * jnp.sqrt(
        omega_matter * (a_inv) ** 3
        + omega_rad * (a_inv) ** 4
        + omega_de * (a_inv) ** (3 * (1 + w0 + wa)) * jnp.exp(-3 * wa * (z / (a_inv)))
    )
  • We also need to quantify our prior beliefs about H0H_0. Let’s assume very wide priors and say that we believe with equal probability that H0H_0 can be anywhere from 50 to 100.
  • Putting everything together in our notation from earlier,
θ=H0yi=Hobs,iyerr,i=Hobs,i uncertaintyypred,i=Hmodel(θ,zi)p(θ)=U(50,100)p(yiθ)=N(yiypred,i,yerr,i)\begin{align} \theta & = H_0 \\ y_i &= H_{obs,\, i} \\ y_{err,\, i} &= H_{obs,\, i} \textrm{ uncertainty} \\ y_{pred,\, i} &= H_{model}(\theta, z_i) \\ p(\theta) & = U(50, 100) \\ p(y_i \mid \theta) &= \mathcal{N}(y_i \mid y_{pred,\, i},y_{err,\, i}) \end{align}

And finally,

p(θy)iN(yiypred,i,yerr,i)U(50,100).p(\theta \mid y) \propto \prod_i \mathcal{N}(y_i \mid y_{pred,\,i },\,y_{err,\,i})\cdot U(50, 100).

Note that the product comes about because each data point is independent.

How would we implement this in code?

This is where a probabilistic (PPL) comes in. A PPL usually has the following capabilities:

  • Stochastic primitives and distributions as first-class citizens
  • Model building frameworks
  • Inference routines that can consume the models and automatically adjust/tune themselves

What PPLs are out there?

There are many! The ones that are most popular in the Python ecosystem are

Each has their own benefits and drawbacks. Today, we’ll focus on NumPyro. Let’s see how to implement our Hubble parameter model in NumPyro.

NumPyro Basics

NumPyro is a PPL built on JAX. It comes with a large selection of distributions, inference routines, and model building tools.

  • NumPyro models are simple Python functions.
  • Within the function, we can call numpyro.distributions.<some-distribution> for each parameter we need to sample.
  • For our likelihood, we can also use a built-in distribution. If we tell NumPyro we have observed data, it will automatically calculate the probability for that data given the likelihood.

The easiest way to see this is with an example. Lets fit a simple Gaussian.

Start by creating some data.

key = jax.random.key(117)
data = jax.random.multivariate_normal(
    key, jnp.array([2, 3]), cov=jnp.array([[1, 0], [0, 1]]), shape=(500,)
)
WARNING:2025-06-17 15:31:05,833:jax._src.xla_bridge:791: An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
_ = plt.scatter(*data.T)
plt.tight_layout()
<Figure size 640x480 with 1 Axes>

Now, we’ll build the NumPyro model. As arguments, it needs to take the input data.

def simple_2d_gaussian(data=None):
    # Get the mean vector
    mu = numpyro.sample("mu", dist.Uniform(-5, 5), sample_shape=(2,))

    # We'll assume a diagonal covariance
    cov_diag = numpyro.sample("cov_diag", dist.Uniform(-10, 10), sample_shape=(2,))

    # Arbitrary transformations are fine!
    # "Deterministic" stores them in the trace
    cov = numpyro.deterministic("cov_mat", jnp.diag(cov_diag))

    # The data are independent
    with numpyro.plate("data", data.shape[0]):
        # Get the likelihood
        numpyro.sample(
            "loglike", dist.MultivariateNormal(loc=mu, covariance_matrix=cov), obs=data
        )

This creates a model that we can visualize as a probabilistic graphical model or more specifically a Bayesian network.

numpyro.render_model(simple_2d_gaussian, model_args=(data,), render_distributions=True)
Loading...

Now, we can pass off our model to a routine that will approximate the posterior. Let’s first, however, take a detour and let you build your own NumPyro model.

Solution to Exercise 3 #
def hz_model(z, hz, hz_err):
    h0 = numpyro.sample("h0", dist.Uniform(50,100))
    with numpyro.plate("data", hz.shape[0]):
        numpyro.sample("loglike", dist.Normal(hz_forward(z, h0=h0), hz_err), obs=hz)

numpyro.render_model(hz_model, model_args=(df['z'].to_numpy(), df['H_z'].to_numpy(), df['H_z_err'].to_numpy()), render_distributions=True)

Approximating Posterior Distributions

Now that we have our model built, we need a way to find the posterior distribution of the parameters of interest. In very rare circumstances, our posterior has a closed form expression (see conjugate priors). Most of the time, we need numerical approximations to learn about the posterior.

One of, if not the most, ubiquitous algorithms for this purpose is Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

This algorithm allows us to draw samples from a posterior when we do not know or cannot calculate the posterior distribution from first principles. The name comes from the merging of two statistical ideas: Markov Chains and Monte Carlo processes.

  • Markov Chains: a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
  • Monte Carlo methods: computational algorithms that rely on repeated random sampling to obtain numerical results

MCMC works in the following way:

  1. The chain of posterior samples is at xx
  2. A new location xx' is drawn from a proposal distribution q(x)q(x), which we’ll use q(x)N(x,σ))q(x) \sim \mathcal{N}(x,\sigma)).
  3. The ratio of the target probability density at the proposed location to the current location is calculated, α=p(x)p(x)\alpha = \frac{p(x')}{p(x)} .
  4. If α>0\alpha \gt 0 the jump is accepted, if α<1\alpha < 1 it’s accepted with a probability of α. If a jump is rejected the current sample is repeated in the chain.

See https://chi-feng.github.io/mcmc-demo/

We’ll create a simple distribution by hand and then draw samples from it.

def pdf(x, mu=10, sigma=2.0):
    "Gaussian PDF"
    return (
        1 / jnp.sqrt(2 * jnp.pi * sigma**2) * jnp.exp(-((x - mu) ** 2) / (2 * sigma**2))
    )


def p(x):
    return 0.5 * (pdf(x, mu=10, sigma=2.0) + pdf(x, mu=15, sigma=1))
from scipy.stats import norm
from tqdm import tqdm
key , _= jax.random.split(key)
x0 = -7.0
x = x0
p_current = p(x0)
chain = [x]
probs = [p_current]

niter = 10000
sigma_jump = 5.

for i in tqdm(range(niter)):
    xp = norm.rvs(loc=x, scale=sigma_jump)
    p_p = p(xp)
    
    α = p_p/p_current
    u = jax.random.uniform(key)
    key, _ = jax.random.split(key)
    accepted = u < α
    
    if accepted:
        x = xp
        p_current = p_p
    chain.append(x)
    probs.append(p_current)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:03<00:00, 3160.57it/s]
plt.hist(chain, bins=100, density=True)
_ = plt.plot(jnp.linspace(-5,20), p(jnp.linspace(-5,20)))
plt.tight_layout()
<Figure size 640x480 with 1 Axes>

We can use ArviZ to explore our posterior more. Here’s an example.

simple_inf_data = az.from_dict({"x":chain})
_ = az.plot_trace(simple_inf_data)
<Figure size 1200x200 with 2 Axes>
_ = az.plot_autocorr(simple_inf_data)
<Figure size 640x480 with 1 Axes>
_ = az.plot_ess(simple_inf_data, kind="evolution")
plt.tight_layout()
<Figure size 640x480 with 1 Axes>

Other Algorithms Beyond MCMC

MCMC works well, and often it is all you need. However, there are a plethora of sampling algorithms to explore. Here are a few you may encounter.

Parallel Tempering (Replica Exchange Monte Carlo)

Our test distribution above was bimodal, but there was enough overlap for the sampler to explore both modalities. What would happen as we increase the bimodality?

def p_big_bimodal(x):
    return 0.5 * (pdf(x, mu=0, sigma=2.0) + pdf(x, mu=30, sigma=1))


key , _= jax.random.split(key)
x0 = -7.0
x = x0
p_current = p_big_bimodal(x0)
chain = [x]
probs = [p_current]

niter = 10000
sigma_jump = 5.

for i in tqdm(range(niter)):
    xp = norm.rvs(loc=x, scale=sigma_jump)
    p_p = p_big_bimodal(xp)
    
    α = p_p/p_current
    u = jax.random.uniform(key)
    key, _ = jax.random.split(key)
    accepted = u < α
    
    if accepted:
        x = xp
        p_current = p_p
    chain.append(x)
    probs.append(p_current)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3736.71it/s]
plt.hist(chain, bins=100, density=True)
_ = plt.plot(jnp.linspace(-5,35), p_big_bimodal(jnp.linspace(-5,35)))
plt.tight_layout()
<Figure size 640x480 with 1 Axes>

The sampler never explores the other modality. How can parallel tempering fix that?

  1. “Temper” the log density: eβp(yθ)p(θ),β=1/Te^{\beta \cdot p(y \mid \theta) p(\theta)},\, \beta = 1/T
    • High temperatures flatten the posterior surface, allowing our chains to explore more space
  2. Run many temperatures in parallel and propose swaps between them

Using Gradient Information

What if instead of a random walk, we wanted to guide our sampler? This is where Hamiltonian Monte Carlo (HMC) shines.

HMC essentially sets up a Hamiltonian for our probabilistic model and solves Hamilton’s equations. As such, we now have equations of motion for our sampler.

This helps tremendously with posteriors that have strange geometries or high dimension.

The general algorithm is:

  1. Define potential energy to be U(x)=θ(p(yθ)p(θ))U(x) = -\nabla_{\theta}\, \left(p (y\mid \theta)p(\theta)\right)
  2. Sample kinetic energy from a proposal distribution
  3. Solve Hamilton’s equations numerically with a symplectic integrator for LL steps of stepsize ε
  4. Accept or reject the final position with a Metropolis step as in standard MCMC
  5. Repeat 2--4

An alternative to HMC is the No-U-Turn Sampler (NUTS) Hoffman & Gelman, 2011. NUTS aims to fix some issues with hyperparameter tuning in HMC, and it’s extremely prevalent in Bayesian inference.

NUTS removes the need to carefully tune the number of steps LL: it is an adaptive algorithm that chooses the best LL to properly explore the posterior. It follows the same algorithm as HMC, but for each iteration solves Hamilton’s equations forwards and backwards in time (reverses momentum sign) until the sampler makes a U-Turn (positions get closer together instead of further apart).

See https://chi-feng.github.io/mcmc-demo/

Let’s sample our simple 2D Gaussian model from earlier with the NUTS sampler from NumPyro.

key, _ = jax.random.split(key)
nuts_mcmc = infer.MCMC(infer.NUTS(simple_2d_gaussian), num_warmup=500, num_samples=1000, num_chains=3, chain_method="sequential")
nuts_mcmc.run(key, data)
sample: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:02<00:00, 738.73it/s, 3 steps of size 7.65e-01. acc. prob=0.91]
sample: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 1520.64it/s, 3 steps of size 5.57e-01. acc. prob=0.94]
sample: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [00:00<00:00, 1838.53it/s, 3 steps of size 8.02e-01. acc. prob=0.88]
gauss_inf_data = az.from_numpyro(nuts_mcmc)
az.summary(gauss_inf_data, var_names="~cov_mat")
Loading...
az.rhat(gauss_inf_data,var_names="~cov_mat")
Loading...
_ = az.plot_trace(gauss_inf_data, var_names="~cov_mat")
<Figure size 1200x400 with 4 Axes>
_ = az.plot_ess(gauss_inf_data, var_names="~cov_mat", kind="evolution")
<Figure size 2944x552 with 4 Axes>
_ = az.plot_pair(gauss_inf_data, var_names="~cov_mat", kind="kde", marginals=True, figsize=(8,8))
<Figure size 800x800 with 10 Axes>
Solution to Exercise 4
hz_nuts = infer.MCMC(infer.NUTS(hz_model), num_warmup = 500, num_samples = 1000, num_chains=3)
hz_nuts.run(hz_nuts, df['z'].to_numpy(), df['H_z'].to_numpy(), df['H_z_err'].to_numpy())

hz_inf_data = az.from_numpyro(hz_nuts)

Bayesian Model Comparison

Now we know how to build models infer their unobserved parameters, but how do we know which model to use for our data? This is where model comparison comes in.

There are many ways to compare Bayesian models and quantify how well each explains your data. Most of these work by finding the Bayes’ factor for one model over another. Remember the Bayesian evidence p(y)=dθp(yθ)p(θ)p(y) = \int d\theta p(y\mid\theta) p(\theta) that we mentioned earlier? The Bayes’ factor for one model over another is the ratio of their evidences.

For models M1M_1 and M2M_2, the Bayes factor

B12=p(yM1)p(yM2).B_{12} = \frac{p(y \mid M_1)}{p(y \mid M_2)}.

We can find the evidence either directly, or indirectly. Directly computing the evidence is generally very hard, as mentioned previously. However, there is a class of samplers that can do this: Nested Sampling.

  • Nested sampling is essentially an integration algorithm for the evidence with parameter inference as a side-effect. It also isn’t feasible to implement nested sampling in high dimension.

If our models are nested, we can compute the Savage-Dickey density ratio to find the Bayes’ factor.

  • For models M1M_1 and M2M_2 with common parameters θ, assume M2=M1M_2 = M_1 when the parameter subset η that is unique to M1M_1 takes values η1\eta_1. Then, the Bayes’ factor is
BF12=p(η1y,M2)p(η1M2).BF_{12} = \frac{p(\eta_1\mid y,\, M_2 )}{p(\eta_1 \mid M_2)}.

For indirect methods, I will mention two: product-space sampling and thermodynamic integration.

  • Product space sampling works by defining a product-space model: a model that considers at least two submodels and treats model selection as a Bayesian problem. For example, we could have a latent variable nn that indexes a list of model likelihoods and selects from them.
    • After running our chains, the ratio of the number of samples in one model over another gives us the Bayes’ factor
  • Thermodynamic integration is made possible by parallel tempering. We define an integral over the different temperature posterior samples that approximates the Bayesian evidence.
ET(x)E[exp(T1x)]log(Z)=dT ET(p(yθ)).\begin{align} E_T(x) &\equiv E\left[\exp{\left(T^{-1} \cdot x\right)}\right] \\ \log(Z) &= \int dT\ E_T(p(y \mid \theta)). \end{align}
Solution to Exercise 5
from numpyro.contrib.nested_sampling import NestedSampler

key, _ = jax.random.split(key)
ns = NestedSampler(hz_model) 
ns.run(key, df['z'].to_numpy(), df['H_z'].to_numpy(), df['H_z_err'].to_numpy())
ns.print_summary()  # Evidence will be log Z

def hz_model_w0_wa(z, hz, hz_err):
    h0 = numpyro.sample("h0", dist.Uniform(50,100))
    w0 = numpyro.sample("w0", dist.Uniform(-2, 0))
    wa = numpyro.sample("wa", dist.Uniform(-2, 0))
    with numpyro.plate("data", hz.shape[0]):
        numpyro.sample("loglike", dist.Normal(hz_forward(z, h0=h0, w0=w0, wa=wa), hz_err), obs=hz)

# Check the model
numpyro.render_model(hz_model_w0_wa, model_args=(df['z'].to_numpy(), df['H_z'].to_numpy(), df['H_z_err'].to_numpy()), render_distributions=True)


key, _ = jax.random.split(key)
ns_w0_wa = NestedSampler(hz_model_w0_wa) 
ns_w0_wa.run(key, df['z'].to_numpy(), df['H_z'].to_numpy(), df['H_z_err'].to_numpy())
ns_w0_wa.print_summary()  # Evidence will be log Z

Expected result is roughly jnp.log10(-249 - -29271) = 4.46, so BFw0wa,ΛCDM104BF_{w_0 w_a,\, \Lambda CDM} \approx 10^4 ! The H(z)H(z) data you used was actually not ΛCDM\Lambda CDM. It was generated with w0=0.2,wa=1.5,H0=67 km/s/Mpcw_0=-0.2,\, w_a=-1.5,\, H_0=67 \textrm{ km/s/Mpc}.

References
  1. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis (Third). CRC. https://stat.columbia.edu/~gelman/book/
  2. Hoffman, M. D., & Gelman, A. (2011). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. arXiv. 10.48550/ARXIV.1111.4246