Markov Chain Monte Carlo

Lauren. Talluto

26.11.2024

The German Tank Problem

  • During WW2, the Allies wanted to know the rate of tank production in German factories
  • Known: factories stamped serial numbers on the tanks in order, starting with 1
  • I have captured a single tank, with (for example), the serial number \(s = 200\).
  • How many tanks have been produced? \((N)\)

The German Tank Problem

Knowns:

  • Captured serial number \(s = 200\)
  • Number of tanks \(N >= s\)
  • Assume tanks are captured completely at random, so all tanks have the same probability of capture
  • This sounds like a uniform distribution!
  • Minimum of the distribution is 1, the maximum will be N.

The German Tank Problem

Knowns:

  • Captured serial number \(s = 200\)
  • Number of tanks \(N >= s\)
  • Assume tanks are captured completely at random, so all tanks have the same probability of capture
  • This sounds like a uniform distribution!
  • Minimum of the distribution is 1, the maximum will be N.
  • Hypothesis: We got the biggest tank; \(s = N\)

The German Tank Problem

Knowns:

  • Captured serial number \(s = 200\)
  • Number of tanks \(N >= s\)
  • Assume tanks are captured completely at random, so all tanks have the same probability of capture
  • This sounds like a uniform distribution!
  • Minimum of the distribution is 1, the maximum will be N.
  • Hypothesis: We got the biggest tank; \(s = N\)

The German Tank Problem

Hypothesis 2: \(N = 400\)

  • The data \(s\) have not changed, just the model
  • All numbers are still equally likely to be captured
    • But now there are twice as many possibilities
    • Each individual tank is half as likely to be captured

The German Tank Problem

Hypothesis 2: \(N = 400\)

  • The data \(s\) have not changed, just the model
  • All numbers are still equally likely to be captured
    • But now there are twice as many possibilities
    • Each individual tank is half as likely to be captured

Hypothesis 3: \(N = 800\)

  • The same logic applies!

The German Tank Problem

Hypothesis 2: \(N = 400\)

  • The data \(s\) have not changed, just the model
  • All numbers are still equally likely to be captured
    • But now there are twice as many possibilities
    • Each individual tank is half as likely to be captured

Hypothesis 3: \(N = 800\)

  • The same logic applies!

… and so on.

What is the MLE? Is there a logical problem with that?

\[ s \sim Uniform(min = 1, max = N)\]

Rejection sampling

  • There are general algorithms for sampling from unknown distributions
    • Here, the target distribution, \(t(x)\), is difficult to sample from

\[ t(x) = pr(s|N) \sim Uniform(s,N) \]

  • However, a uniform distribution with a fixed minimum and maximum is easy to sample from
    • We will call this the proposal distribution \(p(x)\)
    • We will rescale it so that it is always taller than \(t(x)\)

Rejection sampling

Algorithm

  1. Define target distribution t(x)
  2. Define proposal distribution p(x)
  3. Draw a random sample \(y\) from p(x)
  4. Compute acceptance probability \(r = \frac{t(y)}{p(y)}\)
  5. Accept the sample with probability \(r\), reject with probability \(1-r\)
  6. Repeat until desired number of samples obtained

Rejection sampling

Algorithm

  1. Define target distribution t(x)
  2. Define proposal distribution p(x)
  3. Draw a random sample \(y\) from p(x)
  4. Compute acceptance probability \(r = \frac{t(y)}{p(y)}\)
  5. Accept the sample with probability \(r\), reject with probability \(1-r\)
  6. Repeat until desired number of samples obtained

This is incredibly inefficient

s = 200
target = function(x) dunif(s, 1, x)
proposal = function(x) dunif(x, 1, 1e6)

## this scale will make sure p(x) >= t(x)
scale = target(s) / proposal(s) 
candidates = as.integer(runif(1e6, 1, 1e6))
r = target(candidates)/(proposal(candidates) * scale)
## Warning in dunif(s, 1, x): NaNs produced

## uniform numbers between 0 and 1
test = runif(length(r))  

## TRUE if r > test; WHY?
accept = ifelse(r > test, TRUE, FALSE) 
samples = candidates[accept]

Markov Chains

  • Markov chains are defined by a state vector \(S\)
    • In this case, the value of \(S\) represents some parameter
    • For a model with \(k\) parameters, \(S\) is a matrix with \(k\) columns
  • The model is stochastic and has a memory
    • Moves via random walk: \(S_{t+1} = f(S_t)\)
    • Chain must be recurrent: it must be possible to (eventually) reach any possible value from any other possible value
  • Markov models are commonly used to model stochastic processes happening in discrete time steps (e.g., population growth)

\[ S_t = S_{t-1} + fecundity \times S_{t-1} - mortality \times S_{t-1} \]

Markov Chain Monte Carlo

  • MCMC applies Markov chains to increase the acceptance rate of rejection sampling
    • With rejection sampling we jump randomly, anywhere in the state space
    • With MCMC, we take very small steps in space, centered around the most recently accepted value
    • Target acceptance rates of 20-50 %
  • Individual samples are not independent
  • Run for long enough, we can approximate the shape of the posterior

Metropolis-Hastings

  • For an unknown (unnormalized) target distribution \(t(x)\) where we can compute the (proporitonal) height
    • For example, a posterior distribution

\[ pr(\theta | X) \propto pr(X | \theta)pr(\theta) \]

Metropolis-Hastings: starting value

  • For an unknown (unnormalized) target distribution \(t(x)\) where we can compute the (proporitonal) height
    • For example, a posterior distribution

\[ pr(\theta | X) \propto pr(X | \theta)pr(\theta) \]

  1. Choose a starting value

Metropolis-Hastings: proposal step

  • For an unknown (unnormalized) target distribution \(t(x)\) where we can compute the (proporitonal) height
    • For example, a posterior distribution

\[ pr(\theta | X) \propto pr(X | \theta)pr(\theta) \]

  1. Choose a starting value \(S_0\)
  2. Propose a candidate \(S_{cand}\) by sampling from a proposal distribution centred around \(S_0\)
    • Frequently, \(S_{cand} \sim \mathcal{N}(S_t, \sigma_p)\)
    • \(\sigma_p\) is the proposal scale (more on this later)
  3. Compute an acceptance probability \(r = \frac{t(S_{cand})}{t(S_0)}\)
    • accept or reject as in rejection sampling
    • If the candidate is better, \(r > 1\), always accept

Metropolis-Hastings: running the chain

  • For an unknown (unnormalized) target distribution \(t(x)\) where we can compute the (proporitonal) height
    • For example, a posterior distribution

\[ pr(\theta | X) \propto pr(X | \theta)pr(\theta) \]

  1. Choose a starting value \(S_0\)
  2. Propose a candidate \(S_{cand}\) by sampling from a proposal distribution centred around \(S_0\)
    • Frequently, \(S_{cand} \sim \mathcal{N}(S_t, \sigma_p)\)
    • \(\sigma_p\) is the proposal scale (more on this later)
  3. Compute an acceptance probability \(r = \frac{t(S_{cand})}{t(S_0)}\)
    • accept or reject as in rejection sampling
    • If the candidate is better, \(r > 1\), always accept
  4. Continue until the state of the chain converges on the target distribution

Metropolis-Hastings: results

  • For an unknown (unnormalized) target distribution \(t(x)\) where we can compute the (proporitonal) height
    • For example, a posterior distribution

\[ pr(\theta | X) \propto pr(X | \theta)pr(\theta) \]

  1. Choose a starting value \(S_0\)
  2. Propose a candidate \(S_{cand}\) by sampling from a proposal distribution centred around \(S_0\)
    • Frequently, \(S_{cand} \sim \mathcal{N}(S_t, \sigma_p)\)
    • \(\sigma_p\) is the proposal scale (more on this later)
  3. Compute an acceptance probability \(r = \frac{t(S_{cand})}{t(S_0)}\)
    • accept or reject as in rejection sampling
    • If the candidate is better, \(r > 1\), always accept
  4. Continue until the state of the chain converges on the target distribution

Metropolis algorithm summary

Algorithm

Define t(x): log unnormalized posterior (i.e, "target") distribution
Define p(x): the proposal distribution
    common: rnorm(n = 1, mean = x, sd = proposal_scale)
Choose state[0] (the starting value)
for i in 1:n_samples
   candidate = p(state[i-1], proposal_scale)
   r = exp( t(candidate) - t(state[i-1])) ## acceptance probability
   if r > runif(1)  ## coin flip to see if we accept or not
      state[i] = candidate
   else
      state[i] = chain[i-1]

Multivariate Metropolis-Hastings

Tuning Metropolis-Hastings

Traceplots

library(bayesplot)
samples = readRDS("../assets/misc/traceplt_ex.rds")
mcmc_trace(samples)

MCMC in Stan