Maximum Likelihood & Optimisation
Lauren Talluto
26.11.2024
The problem of parameter estimation
- In the Zombie example, we knew the probability of being a zombie:
\(p_z = 0.3\)
- We wanted to know the probability of observing some number of
zombies \(k\) in a sample of \(n=10\)
- More typically, we would observe \(k\) via taking a sample of size
\(n\), and then try to estimate \(p_z\).
Sampling a population
- We sampled a large population to determine the rate of zombism.
- Assume samples were random, iid
- Given a sample of \(n=25\), we
observed \(k=7\) zombies.
- Estimate \(p_z\), the proportion of
the population that is a zombie.
Sampling a population
- We sampled a large population to determine the rate of zombism.
- Assume samples were random, iid
- Given a sample of \(n=25\), we
observed \(k=7\) zombies.
- Estimate \(p_z\), the proportion of
the population that is a zombie.
- Intuition: We know the best estimate is \(p_z = 7/25 \approx 0.28\). Why?
Sampling a population
- We sampled a large population to determine the rate of zombism.
- Assume samples were random, iid
- Given a sample of \(n=25\), we
observed \(k=7\) zombies.
- Estimate \(p_z\), the proportion of
the population that is a zombie.
- Intuition: We know the best estimate is \(p_z = 7/25 \approx 0.28\). Why?
- However: Many other values of \(p_z\) are nearly as likely.
- Here, even if \(p_z = 0.4\), there
is an 8% chance of getting \(k =
7\).
Parameter estimation
- We need a general method for parameter
estimation
- In this case, want to estimate \(p_z\)
- Understanding the uncertainty in our estimate would
also be nice
The likelihood model
- If I have a model \(\theta\)…
- And a dataset \(X\)…
- Can I compute the probability that the data came from this model?
\(pr(X | \theta)\)
The likelihood model
- Example: \(\theta = p_z\), assume
for the moment that \(p_z = 0.4\)
- What is the probability of observing the data (\(k = 7\) with\(n =
25\)) if this model is true?
The likelihood model
- Example: \(\theta = p_z\), assume
for the moment that \(p_z = 0.4\)
- What is the probability of observing the data (\(k = 7\) with\(n =
25\)) if this model is true?
\[
pr(k=7| p_z = 0.4, n = 25) = ?
\]
Any guesses how?
The likelihood model
- We already described this system with a binomial
process
- This is a generative model: we describe the
statistical process (a binomial process with p = 0.4) that produces the
observed data.
- We can evaluate it with built-in functions in R
The likelihood model
- We already described this system with a binomial
process
- This is a generative model: we describe the
statistical process (a binomial process with p = 0.4) that produces the
observed data.
- We can evaluate it with built-in functions in R
# x is observation, in this case, k = 7
# size is n, the number of trials
# prob is the model parameter
dbinom(x = 7, size = 25, prob = 0.4)
## [1] 0.07998648
The likelihood model
# x is observation, in this case, k = 7
# size is n, the number of trials
# prob is the model parameter
dbinom(x = 7, size = 25, prob = 0.4)
## [1] 0.07998648
We can try another model to see if it’s better:
dbinom(x = 7, size = 25, prob = 0.2)
## [1] 0.1108419
The likelihood model
# x is observation, in this case, k = 7
# size is n, the number of trials
# prob is the model parameter
dbinom(x = 7, size = 25, prob = 0.4)
## [1] 0.07998648
We can try another model to see if it’s better:
dbinom(x = 7, size = 25, prob = 0.2)
## [1] 0.1108419
And we can plot it against many different potential models
# different potential values for p, which must be between 0 and 1
models = seq(0, 1, 0.005)
probs = dbinom(x = 7, size = 25, prob = models)
binom_models = data.frame(models, probs)
binom_plot = ggplot(binom_models, aes(x = models, y = probs)) + geom_line()
Likelihood functions
- The likelihood function is a function f of
the data \(X\) and model/parameters
\(\theta\)
- Here \(X = \{k\}\) and \(\theta = \{p_z\}\)
- \(n\) is a constant defined by our
study design.
- \(f(X,\theta)\) returns the
probability of observing \(X\), given a particular model \(\theta\): \(pr(X|\theta)\)
- Here, the binomial PMF is a useful likelihood function
Likelihood functions
- Intuition is fine, but how do we estimate or solve for the maximum
likelihood? Guessing might take a long time!
Solving for the MLE
\[
\begin{aligned}
\mathcal{L}(k,n|p) & = {n \choose k} p^k(1-p)^{(n-k)}
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\mathcal{L}(k,n|p) & = {n \choose k} p^k(1-p)^{(n-k)} \\
\frac{d \mathcal{L}(k,n|p)}{dp} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)}
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\mathcal{L}(k,n|p) & = {n \choose k} p^k(1-p)^{(n-k)} \\
\frac{d \mathcal{L}(k,n|p)}{dp} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
& = 0
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{\xcancel{n \choose k}} p^k (n-k)(1-p)^{(n-k-1)} & = {\xcancel{n
\choose k}}kp^{k-1}(1-p)^{(n-k)}
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\\
p(p^{k-1}) (n-k)(1-p)^{(n-k-1)} & = kp^{k-1}(1-p)(1-p)^{(n-k-1)} \\
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\\
p{\xcancel{(p^{k-1})}} (n-k){\xcancel{(1-p)^{(n-k-1)}}} & =
k{\xcancel{p^{k-1}}}(1-p){\xcancel{(1-p)^{(n-k-1)}}} \\
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\\
p(p^{k-1}) (n-k)(1-p)^{(n-k-1)} & = kp^{k-1}(1-p)(1-p)^{(n-k-1)} \\
p (n-k) & = k(1-p) \\
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\\
p(p^{k-1}) (n-k)(1-p)^{(n-k-1)} & = kp^{k-1}(1-p)(1-p)^{(n-k-1)} \\
p (n-k) & = k(1-p) \\
pn -pk & = k-pk \\
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\\
p(p^{k-1}) (n-k)(1-p)^{(n-k-1)} & = kp^{k-1}(1-p)(1-p)^{(n-k-1)} \\
p (n-k) & = k(1-p) \\
pn - {\xcancel{pk}} & = k-{\xcancel{pk}} \\
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\\
p(p^{k-1}) (n-k)(1-p)^{(n-k-1)} & = kp^{k-1}(1-p)(1-p)^{(n-k-1)} \\
p (n-k) & = k(1-p) \\
pn -pk & = k-pk \\
pn & = k \\
\end{aligned}
\]
Solving for the MLE
\[
\begin{aligned}
\frac{d \mathcal{L}(k,n|p)}{dp} = 0 & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)} - {n \choose k} p^k (n-k)(1-p)^{(n-k-1)} \\
{n \choose k} p^k (n-k)(1-p)^{(n-k-1)} & = {n \choose
k}kp^{k-1}(1-p)^{(n-k)}
\\
p(p^{k-1}) (n-k)(1-p)^{(n-k-1)} & = kp^{k-1}(1-p)(1-p)^{(n-k-1)} \\
p (n-k) & = k(1-p) \\
pn -pk & = k-pk \\
pn & = k \\
p & = \frac{k}{n} \\
\end{aligned}
\]
Optimisation
- In many (most) cases, analytical solutions are unavailable or
impractical.
- We turn to various algorithms for numerical optimisation
Optimisation
- In many (most) cases, analytical solutions are unavailable or
impractical.
- We turn to various algorithms for numerical optimisation
# define our data set
n = 25
k = 7
# write a custom function that returns the log likelihood
llfun = function(p, n, k) {
dbinom(k, n, p, log=TRUE) ## why log??
}
# we need an initial guess for the parameter we want to optimise
p_init = 0.5
Optimisation
- In many (most) cases, analytical solutions are unavailable or
impractical.
- We turn to various algorithms for numerical optimisation
# optim will start at p_init, evaluate llfun using the data we pass
# and return the optimum (if found)
# the terms n = ... and k = ... must be named the way they are in llfun
optim(p_init, llfun, method = "Brent", n = n, k = k, control = list(fnscale = -1), lower=0, upper=1)
## $par
## [1] 0.28
##
## $value
## [1] -1.740834
##
## $counts
## function gradient
## NA NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Parameter estimation in Stan
- Stan is a modelling language for scientific computing
- We use probabilistic programming
- deterministic variables:
a = b * c
- Stochastic variables:
y ~ normal(mu, sigma)
Workflow
- Write a Stan model in a
.stan
file
- Prepare all data in R
- Use the
rstan
package to invoke the Stan interpreter
- Translates your model into a C++ program then compiles for your
computer
- Run the program from R using functions in the
rstan
package.
Parameter estimation in Stan
- All Stan programs need a
data
block where you define
any input data for your model
- Variables must have a type
- Here,
int
indicates a variable that is an integer
- Constraints give rules about variables; here k must
be positive, and n >= k
- The data block is also a great place for constants
(like n)!
File: vu_advstats_students/stan/zomb_p.stan
data {
int <lower = 0> k;
int <lower = k> n;
}
Parameter estimation in Stan
- In the
parameters
block, we define
unknowns that we want to estimate
p
is a real number between 0 and 1
File: vu_advstats_students/stan/zomb_p.stan
data {
int <lower = 0> k;
int <lower = k> n;
}
parameters {
real <lower = 0, upper = 1> p;
}
Parameter estimation in Stan
- In the
model
block we specify our likelihood function
k
comes from a binomial distribution with n
observations and probability parameter p
File: vu_advstats_students/stan/zomb_p.stan
data {
int <lower = 0> k;
int <lower = k> n;
}
parameters {
real <lower = 0, upper = 1> p;
}
model {
k ~ binomial(n, p);
}
Parameter estimation in Stan
- In R, we load the
rstan
package
- Next we compile the model using
stan_model
- Compiled models don’t need to be re-compiled unless the code in the
.stan file changes!
# assuming working directory is vu_advstats_students
library(rstan)
zomb_p = stan_model("stan/zomb_p.stan")
Parameter estimation in Stan
- We next create a data list that we will pass on to Stan
- Finally, we can use
optimizing
, which works similar to
optim
but for Stan models
# assuming working directory is vu_advstats_students
library(rstan)
## reminder: only do this once unless zomb_p.stan changes!
zomb_p = stan_model("stan/zomb_p.stan")
# prepare data
# names must match the data block in Stan
zomb_data = list(n = 25, k = 7)
# estimate the parameter
optimizing(zomb_p, data = zomb_data)
## $par
## p
## 0.2800003
##
## $value
## [1] -14.82383
##
## $return_code
## [1] 0
##
## $theta_tilde
## p
## [1,] 0.2800003
Generalising to multiple observations
- Remember the product rule: for two independent events,
\[pr(A,B) = pr(A)pr(B)\]
- Likelihoods are probabilities, and we like to assume each data point
is independent. Thus:
\[
\begin{array}
\mathcal{L}(X_{1..n}|\theta) & = \prod_{i=1}^{n}
\mathcal{L}(X_i|\theta) \\
& \mathrm{or} \\
\log \mathcal{L}(X_{1..n}|\theta) &= \sum_{i=1}^{n} \log
\mathcal{L}(X_i|\theta)
\end{array}
\]
Generalising to multiple observations
- Remember the product rule: for two independent events,
\[pr(A,B) = pr(A)pr(B)\]
- Likelihoods are probabilities, and we like to assume each data point
is independent. Thus:
\[
\begin{array}
\mathcal{L}(X_{1..n}|\theta) & = \prod_{i=1}^{n}
\mathcal{L}(X_i|\theta) \\
& \mathrm{or} \\
\log \mathcal{L}(X_{1..n}|\theta) &= \sum_{i=1}^{n} \log
\mathcal{L}(X_i|\theta)
\end{array}
\]
# define our data set and initial guess
n_vec = c(25, 12, 134)
k_vec = c(7, 4, 27)
p_init = 0.5
# we take the sum of all the individual log liklihoods
llfun = function(p, n, k) {
sum(dbinom(k, n, p, log=TRUE))
}
# take only the part we want out of this, $par, the parameter estimate
optim(p_init, llfun, method = "Brent", n = n_vec, k = k_vec,
control = list(fnscale = -1), lower=0, upper=1)$par
## [1] 0.2222222
Multiple observations in Stan
k
and n
are arrays of
int
s, each with a length of n_obs
- Comments indicated with
//
- The
binomial
function is vectorised, can operate on
multiple observations with no changes.
data {
int <lower = 1> n_obs; // number of data points
int <lower = 0> k [n_obs];
int <lower = 0> n [n_obs]; // number of trials for each data point
}
parameters {
real <lower = 0, upper = 1> p;
}
model {
k ~ binomial(n, p);
}
zomb_p2 = stan_model("stan/zomb_p2.stan")
Multiple observations in Stan
# prepare data
# names must match the data block in Stan
zomb_data = list(n = c(25, 12, 134), k = c(7, 4, 27), n_obs = 3)
# estimate the parameter
optimizing(zomb_p2, data = zomb_data)$par
## p
## 0.2222296
Why Bayes?
- I want to describe some phenomenon (“model”; \(\theta\))
- I have some general (“prior”) knowledge about the question: \(pr(\theta)\)
- I gather additional information (“data”; \(X\))
What is the probability that my model is correct given what I already
know about it and what I’ve learned?
\[ pr(\theta | X) \]
Applying Bayes’ Theorem
- We already know an expression for this:
\[
pr(\theta | X) = \frac{pr(X|\theta)pr(\theta)}{pr(X)}
\]
Applying Bayes’ Theorem
- We already know an expression for this:
\[
pr(\theta | X) = \frac{pr(X|\theta)pr(\theta)}{pr(X)}
\]
- The goal, \(pr(\theta | X)\), is
called the posterior probability of \(\theta\)
- We have already seen \(pr(X|\theta)\); this is the
likelihood of the data
- \(pr(\theta)\) is often called the
prior probability of \(\theta\). Could also be called
“other information about \(\theta\)”
- What about \(pr(X)\)? This is the
normalizing constant
When we do Bayesian inference, each of these terms is a full
probability distribution
The normalizing constant
- For the first zombie problem, we had a single observation (“I tested
positive”), we were able to add up all of the ways one could test
positive.
- \(pr(T) = pr(T,Z) + pr(T,Z') =
pr(T|Z)pr(Z) + pr(T|Z')pr(Z')\)
The normalizing constant
- For the first zombie problem, we had a single observation (“I tested
positive”), we were able to add up all of the ways one could test
positive.
- \(pr(T) = pr(T,Z) + pr(T,Z') =
pr(T|Z)pr(Z) + pr(T|Z')pr(Z')\)
More generally:
- \(pr(X) = \sum_i^n
pr(X|\theta_i)pr(\theta_i)\) where all possible models are in the
set \(n\)
- What about for continuous problems?
The normalizing constant
- For the first zombie problem, we had a single observation (“I tested
positive”), we were able to add up all of the ways one could test
positive.
- \(pr(T) = pr(T,Z) + pr(T,Z') =
pr(T|Z)pr(Z) + pr(T|Z')pr(Z')\)
More generally:
- \(pr(X) = \sum_i^n
pr(X|\theta_i)pr(\theta_i)\) where all possible models are in the
set \(n\)
- What about for continuous problems?
- There are an infinite number of possible models if \(pr(\theta)\) is a continuous PDF.
- There are infinitely many possible datasets if X is real-valued
The normalizing constant
- For the first zombie problem, we had a single observation (“I tested
positive”), we were able to add up all of the ways one could test
positive.
- \(pr(T) = pr(T,Z) + pr(T,Z') =
pr(T|Z)pr(Z) + pr(T|Z')pr(Z')\)
More generally:
- \(pr(X) = \sum_i^n
pr(X|\theta_i)pr(\theta_i)\) where all possible models are in the
set \(n\)
- What about for continuous problems?
- There are an infinite number of possible models if \(pr(\theta)\) is a continuous PDF.
- There are infinitely many possible datasets if X is real-valued
\[
pr(X) = \int_a^b pr(X|\theta)pr(\theta)d \theta
\]
- This integral can be challenging to compute
Proportional Bayes’ Theorem
- \(pr(X)\) is a constant;
it adjusts the height of the distribution so that the posterior
integrates to 1
- If all we want to do is estimate the maximum value, we can safely
ignore it
Proportional Bayes’ Theorem
- \(pr(X)\) is a constant;
it adjusts the height of the distribution so that the posterior
integrates to 1
- If all we want to do is estimate the maximum value, we can safely
ignore it
\[
pr(\theta|X) \propto pr(X|\theta)pr(\theta)
\]
- For our example, we used the binomial PMF for the likelihood:
\[
pr(X|\theta) = pr(k,n|p) = {n \choose k} p^k(1-p)^{(n-k)}
\]
How to choose the prior, \(pr(\theta)\)?
What do we know about p?
- Must be between 0 and 1.
- We could assign equal probabilities using a uniform
distribution:
theta = seq(0, 1, 0.01)
pr_theta = dunif(theta, min = 0, max = 1)
plot(theta, pr_theta, type = 'l')
What do we know about p?
- Must be between 0 and 1.
- We could assign equal probabilities using a uniform
distribution.
- Not very flexible. Maybe we think central values are slightly more
likely?
- A beta distribution makes many shapes possible
theta = seq(0, 1, 0.01)
pr_theta_1 = dbeta(theta, shape1 = 1, shape2 = 1)
Maximum A Posteriori Estimation
- If we just want a Bayesian point estimate for \(\theta\), we can use the same algorithms
for MLE
- This is know as the maximum a posteriori (MAP)
estimate, Bayesian equivalent to the MLE
- We ignore the normalising constant and incorporate a prior into the
methods we used before
log_liklihood = function(p, n, k)
sum(dbinom(k, n, p, log=TRUE))
log_prior = function(p, a, b)
dbeta(p, a, b, log = TRUE)
log_posterior = function(p, n, k, a, b)
log_liklihood(p, n, k) + log_prior(p, a, b)
// Saved in vu_advstats_students/stan/zomb_p_bayes.stan
data {
int <lower = 1> n_obs;
int <lower = 0> k [n_obs];
int <lower = 0> n [n_obs];
real <lower = 0> a;
real <lower = 0> b;
}
parameters {
real <lower = 0, upper = 1> p;
}
model {
k ~ binomial(n, p);
p ~ beta(a, b);
}
Maximum A Posteriori Estimation
Now we can fit the model, either in R or in Stan
zomb_p_bayes = stan_model("stan/zomb_p_bayes.stan")
zomb_data = list(
n_obs = 1,
n = as.array(25), # force a single-value array, avoids an error
k = as.array(7),
a = 1, # totally flat prior to start
b = 1
)
## starting value for the optimisation
p_init = 0.5
optim(p_init, log_posterior, method = "Brent", n = zomb_data$n,
k = zomb_data$k, a = zomb_data$a, b = zomb_data$b,
control = list(fnscale = -1), lower=0, upper=1)$par
## [1] 0.28
optimizing(zomb_p_bayes, data = zomb_data)$par
## p
## 0.2800001
![](2_mle_files/figure-slidy/unnamed-chunk-38-1.png)
This prior has no influence on the posterior
Changing the prior
zomb_data$a = 2; zomb_data$b = 2
optimizing(zomb_p_bayes, data = zomb_data)$par
## p
## 0.2963022
zomb_data$a = 3; zomb_data$b = 1.5
optimizing(zomb_p_bayes, data = zomb_data)$par
## p
## 0.3272728
These priors are informative, but relatively weak (our data has
weight equivalent to alpha=7, beta=18
)
Changing the prior
What if we had already conducted an identical prior sample, with 20
zombies and 5 normals?
zomb_data$a = 20; zomb_data$b = 5
optimizing(zomb_p_bayes, data = zomb_data)$par
## p
## 0.5416636
Geting a normalized posterior
- We often want to know the full posterior distribution
- For some problems, we have analytical solutions to \(\int pr(X|\theta)pr(\theta)d \theta\)
Geting a normalized posterior
- We often want to know the full posterior distribution
- For some problems, we have analytical solutions to \(\int pr(X|\theta)pr(\theta)d \theta\)
- For the beta-binomial, if:
\[
\begin{aligned}
\overset{\small \color{blue}{prior}}{pr(p)} & = \mathrm{Beta}(a, b)
\\ \\
\overset{\small \color{blue}{likelihood}}{pr(k,n | p)} & =
\mathrm{Binomial}(k, n, p)
\end{aligned}
\]
Geting a normalized posterior
- We often want to know the full posterior distribution
- For some problems, we have analytical solutions to \(\int pr(X|\theta)pr(\theta)d \theta\)
- For the beta-binomial, if:
\[
\begin{aligned}
\overset{\small \color{blue}{prior}}{pr(p)} & = \mathrm{Beta}(a, b)
\\ \\
\overset{\small \color{blue}{likelihood}}{pr(k,n | p)} & =
\mathrm{Binomial}(k, n, p) \\ \\
\overset{\small \color{blue}{posterior}}{pr(p | k,n,a,b)} &=
\mathrm{Beta}(a + k, b + n - k)
\end{aligned}
\]
Geting a normalized posterior
- We often want to know the full posterior distribution
- For some problems, we have analytical solutions to \(\int pr(X|\theta)pr(\theta)d \theta\)
- In many cases, the shape of the log posterior is appoximately
quadratic (the posterior is approximately normal)
- Laplace approximation (or Quadratic approximation)
is a method for approximating the shape of this curve
Geting a normalized posterior
- We often want to know the full posterior distribution
- For some problems, we have analytical solutions to \(\int pr(X|\theta)pr(\theta)d \theta\)
- In many cases, the shape of the log posterior is appoximately
quadratic (the posterior is approximately normal)
- Laplace approximation (or Quadratic approximation)
is a method for approximating the shape of this curve
![](2_mle_files/figure-slidy/unnamed-chunk-44-1.png)
Geting a normalized posterior
- We often want to know the full posterior distribution
- For some problems, we have analytical solutions to \(\int pr(X|\theta)pr(\theta)d \theta\)
- In many cases, the shape of the log posterior is appoximately
quadratic (the posterior is approximately normal)
- Laplace approximation (or Quadratic approximation)
is a method for approximating the shape of this curve
- When the above are unavailable, we can use simulations (e.g.,
MCMC)