Regression & the Generalised Linear Model

Lauren Talluto

29.11.2024

Regression problems

  • Many applied problems take the form of a set of observations \(y\) along with one or more covariates \(x\)
  • We want to know how \(y\) changes (on average) with \(x\), and how much variance there is in \(y\)

Regression problems

  • We can draw a line through the points that minimizes the sum of squared residuals \(\sum_i (y_i - \mathbb{E}[y|x_i])^2\)
  • \(\mathbb{E}[y|x_i]\) is the expectation of \(y\) conditional on x
    • what do we expect \(y\) to be, on average, for any value of \(x\)

Regression problems

  • We can draw a line through the points that minimizes the sum of squared residuals \(\sum_i (y_i - \mathbb{E}[y|x_i])^2\)
  • \(\mathbb{E}[y|x_i]\) is the expectation of \(y\) conditional on x
    • what do we expect \(y\) to be, on average, for any value of \(x\)
  • This line has two parameters, an intercept \(a\) and a slope \(b\)
    • \(\mathbb{E}(y|x_i) = a + bx_i\)

Regression problems

  • We can draw a line through the points that minimizes the sum of squared residuals \(\sum_i (y_i - \mathbb{E}[y|x_i])^2\)
  • \(\mathbb{E}[y|x_i]\) is the expectation of \(y\) conditional on x
    • what do we expect \(y\) to be, on average, for any value of \(x\)
  • This line has two parameters, an intercept \(a\) and a slope \(b\)
    • \(\mathbb{E}(y|x_i) = a + bx_i\)
  • \(y \sim \mathcal{N}(a + bx, s)\)
    • “y is distributed as”

Regression problems

  • We can draw a line through the points that minimizes the sum of squared residuals \(\sum_i (y_i - \mathbb{E}[y|x_i])^2\)
  • \(\mathbb{E}[y|x_i]\) is the expectation of \(y\) conditional on x
    • what do we expect \(y\) to be, on average, for any value of \(x\)
  • This line has two parameters, an intercept \(a\) and a slope \(b\)
    • \(\mathbb{E}(y|x_i) = a + bx_i\)
  • \(y \sim \mathcal{N}(a + bx, s)\)
    • “y is distributed as”
  • Key assumptions:
    • \(y\) is iid
    • \(y\) has constant variance with respect to \(x\)
    • the residuals are normally distributed (or \(y|x\) is normally distributed)

Graphing the model

  • It is always a good idea to graph your model
  • The graph represents knowns and unknowns as nodes, with relationships as edges
  • Variables (nodes) may be either stochastic or deterministic
  • Your model must estimate all unknowns
    • stochastic unknowns are usually parameters
    • deterministic unknowns must be computed somewhere in your model
    • stochastic knowns usually appear as the first parameter is a probability statement like dnorm

Graphing the model

  • Bayesian models that can be fit in Stan are directed acyclic graphs
  • Can draw them by hand, or use igraph and ggnetwork
library("igraph")
library("ggnetwork")

# Here we define the edges
# -+ is an arrow you want to draw (+ is the arrowhead)
# Nodes are created implicitly; name it here and it will be added
gr = graph_from_literal(a-+"E(y)", b-+"E(y)", s-+y, x-+"E(y)", "E(y)"-+y)

# Here we define two attributes, type and source
# These are assigned to nodes using the V (for Vertex) function
# The order is defined by the order the nodes appear when the graph is created
V(gr)$type = c("stochastic", "deterministic", "stochastic", 
               "stochastic", "stochastic", "deterministic")
V(gr)$source = c("unknown", "unknown", "unknown", 
                 "unknown", "known", "known")

# Here we define an optional layout, telling ggnetwork where to put each node
# The matrix should have one row per node, and columns are x and y coordinates
# If you skip this, ggnetwork will try to make a pretty graph by itself
layout = matrix(c(0.5,2,  0.5,1,  1,2,  1.5,2,  1.25,0, 0,2), 
                byrow=TRUE, ncol=2)


n = ggnetwork(gr, layout=layout)
(pl = ggplot(n, aes(x = x, y = y, xend = xend, yend = yend)) + 
   geom_edges(colour="gray50", arrow=arrow(length = unit(6, "pt"), 
                                type = "closed")) + 
   theme_blank() + geom_nodes(aes(color=type, shape = source), size=6) + 
   geom_nodelabel(aes(label = name), fontface = "bold", nudge_x=-0.1))

Using the graph

  • You can use the graph to help you design your Stan code!
  1. Start with an empty model
  2. Simple guideline: everything in your graph must either:
    • Appear on the left side of the ~ symbol (stochastic nodes)
    • Appear on the left side of the = symbol (deterministic unknowns)
    • Appear in the data block (everything else)
    • Stochastic variables that are also measured will be in data and on the left side of ~
data {

}
parameters {

}
transformed parameters {

}
model {

}

Using the graph

  • Start at the bottom, define the probabilistic model for y. I will define the distribution of y (place it on the left side of ~).
    • In our graph, y has two dependencies (eta and s); these should appear on the right side of the expression for y.
    • I also need to declare the variable (i.e., tell Stan what kind of variable y is: a vector with length n)
data {
    vector [n] y;
}
parameters {

}
transformed parameters {

}
model {
    y ~ normal(eta, s);
}

Using the graph

  • Everything I used to declare/define y now needs it’s own declaration and definition.
    • Here we do s and n, which have no dependencies.
data {
    int <lower = 1> n;
    vector [n] y;
}
parameters {
    real <lower = 0> s;
}
transformed parameters {

}
model {
    y ~ normal(eta, s);
    s ~ exponential(0.1);
}

Using the graph

  • Iterate, following each dependency until everything on the graph is in the model
data {
    int <lower = 1> n;
    vector [n] y;
}
parameters {
    real <lower = 0> s;
}
transformed parameters {
    vector [n] eta;
    eta = a + b * x;
}
model {
    y ~ normal(eta, s);
    s ~ exponential(0.1);
}

Using the graph

  • Iterate, following each dependency until everything on the graph is in the model
data {
    int <lower = 1> n;
    vector [n] y;
    vector [n] x;
}
parameters {
    real <lower = 0> s;
    real a;
    real b;
}
transformed parameters {
    vector [n] eta;
    eta = a + b * x;
}
model {
    y ~ normal(eta, s);
    s ~ exponential(0.1);
    a ~ normal(0, 10);
    b ~ normal(0, 5);
}

What to put where??

These are guidelines not strict rules…

data {
    // declaration of all knowns
}
parameters {
    // declaration of all stochastic unkowns
}
transformed parameters {
    // declaration and definition of deterministic unknowns
}
model {
    // definition of all unknowns
}

Fitting to the Penguins

data {
    int <lower = 1> n;
    vector [n] bill_depth;
    vector [n] bill_len;
}
parameters {
    real <lower = 0> s;
    real a;
    real b;
}
transformed parameters {
    vector [n] eta;
    eta = a + b * bill_len;
}
model {
    bill_depth ~ normal(eta, s);
    s ~ exponential(0.1);
    a ~ normal(0, 10);
    b ~ normal(0, 5);
}

Fitting to the Penguins

library(rstan)
peng_lm = stan_model("stan/penguin_lm.stan")
penguins = as.data.frame(palmerpenguins::penguins)
penguins = subset(penguins, complete.cases(penguins) & species == "Gentoo")
peng_dat = with(penguins, list(
    bill_len = bill_length_mm,
    bill_depth = bill_depth_mm,
    n = length(bill_length_mm)
))
peng_fit = sampling(peng_lm, data = peng_dat, refresh = 0)
print(peng_fit, pars = c("a", "b", "s"))
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##   mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## a 5.02    0.03 1.07 2.91 4.28 5.02 5.76  7.17  1184    1
## b 0.21    0.00 0.02 0.16 0.19 0.21 0.23  0.25  1174    1
## s 0.76    0.00 0.05 0.67 0.72 0.76 0.79  0.87  1572    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Nov 29 12:29:40 2024.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

More linear models

  • Our linear model used a single x-variable: \(\mathbb{E}(y) = a + bx\)
  • It is trivial to add additional predictors to a model: \(\mathbb{E}(y) = a + b_1x_1 + b_2x_2 + \dots + b_nx_n\)

Adding a Curve

  • Our linear model used a single x-variable: \(\mathbb{E}(y) = a + bx\)
  • It is trivial to add additional predictors to a model: \(\mathbb{E}(y) = a + b_1x_1 + b_2x_2 + \dots + b_nx_n\)
  • One reason to do this is to add a curve: we could add a new variable called \(x^2\)

\[ \mathbb{E}(y) = a + b_1x + b_2x^2 \]

Dataset: Nancy Howell (Univ. Toronto) | Measurements of the Dobe !Kung people | Republished in McElreath 2020

library(data.table)
kung = fread("https://github.com/rmcelreath/rethinking/raw/master/data/Howell1.csv")
(pl = ggplot(data = kung, aes(x = weight, y = height)) + geom_point(col = "#555555aa", size = 0.9) +
    geom_smooth(method = "lm", col = "blue", linewidth = 1.5, se = FALSE) +
    geom_smooth(method = "lm", col = "red", linewidth = 0.8, lty = 2, formula = y~x+I(x^2), se = FALSE) +
    theme_minimal())

Adding a Curve

  • Our linear model used a single x-variable: \(\mathbb{E}(y) = a + bx\)
  • It is trivial to add additional predictors to a model: \(\mathbb{E}(y) = a + b_1x_1 + b_2x_2 + \dots + b_nx_n\)
  • One reason to do this is to add a curve: we could add a new variable called \(x^2\)

\[ \mathbb{E}(y) = a + b_1x + b_2x^2 \]

  • Use caution: curves can predict silly things
    • does it make sense that height decreases after a certain weight?
  • This is still a linear model: \(\mathbb{E}(y)\) is linear with respect to transformations of x

Interactions

  • If we multiply predictors, we produce interactions

Important: Always remember main effects if including interactions

\[ \mathbb{E}(height) = a + b_1 \times weight + b_2 \times age + b_3 \times weight \times age \]

Categorical Variables

\[ depth = a + b_1 \times length + b_2 \times species?? \]

Categorical Variables

  • Categorical variables also fit into this scheme
  • However, we need to consider how we want to model the categories!
  • We use dummy coding
  • First assign a reference category - here we will use Adelie
    • The intercept becomes the intercept only for this category
    • So Adelie penguins have an intercept = a
  • Then create binary columns indicating the other two categories
    • “Intercept” for each category will be $a + $ some offset
    • E.g., for Chinstrap, the intercept is \(a + b_2\)
penguins$chinstrap = ifelse(penguins$species == 'Chinstrap', 1, 0)
penguins$gentoo = ifelse(penguins$species == 'Gentoo', 1, 0)

\[ depth = a + b_1 \times length + b_2 \times chinstrap + b_3 \times gentoo \]

Categorical Variables

  • Categorical variables also fit into this scheme
  • However, we need to consider how we want to model the categories!
  • We use dummy coding
  • First assign a reference category - here we will use Adelie
    • The intercept becomes the intercept only for this category
    • So Adelie penguins have an intercept = a
  • Then create binary columns indicating the other two categories
    • “Intercept” for each category will be $a + $ some offset
    • E.g., for Chinstrap, the intercept is \(a + b_2\)
  • We can use interactions to give each category it’s own slope
penguins$chinstrap = ifelse(penguins$species == 'Chinstrap', 1, 0)
penguins$gentoo = ifelse(penguins$species == 'Gentoo', 1, 0)

\[ depth = a + b_1 \times length + b_2 \times chinstrap + b_3 \times gentoo + b_4 \times chinstrap \times length + b_5 \times gentoo \times length \]

Matrix Notation

  • Since our model is a linear system, we can use linear algebra to describe it
  • This will greatly simplify the Stan code!
  • X is a matrix of predictors, one row per observation, one column per variable
    • Include transformations! If you have \(x\) and \(x^2\), these are separate columns!
    • Dummy coded categories also get one category each except the reference category
  • B is a vector of parameters, one parameter per variable
  • XB (using matrix multiplication) returns \(b_1x_1 + b_2x_2 + \dots\)
  • Now you can change the predictors in your model without changing the model code!

\[ \eta = \mathbb{E}(y) = a + \mathbf{XB} \]

data {
    int <lower = 1> k; // number of predictor variables
    int <lower = 1> n;
    matrix [n, k] X;
    vector [n] y;
}
parameters {
    real <lower = 0> s;
    vector [k] B;
}
transformed parameters {
    vector eta = a + X * B;
}
model {
    y ~ normal(eta, s);
    s ~ exponential(0.1);
    a ~ normal(0, 10);
    B ~ normal(0, 5); // prior here is vectorised, we use the same for each B!
}

Further generalising our model

  • A motivating example: Tsuga canadensis mortality in eastern North America.
    • Hemlock seems to prefer moderate temperatures and high precipitation.

Research question: Does mortality decrease with increasing precipitation?

tsuga = readRDS("../vu_advstats_students/data/tsuga.rds")
head(tsuga[, 4:8])
##   born died n annual_mean_temp tot_annual_pp
## 1    1    0 0             3.41          1085
## 2    0    3 5             3.85          1003
## 3    3    0 6             3.45          1076
## 4    1    0 3             3.62          1099
## 5    0    4 4             4.60           989
## 6    4    0 3             4.24          1116

Further generalising our model

  • A motivating example: Tsuga canadensis mortality in eastern North America.
    • Hemlock seems to prefer moderate temperatures and high precipitation.

Research question: Does mortality decrease with increasing precipitation?

We can visualise this by computing the proportion of trees that died in each plot.

Mortality in Tsuga

Research question: Does mortality decrease with increasing precipitation?

  • We already used a binomial likelihood to model mortality. What if we modify our linear model this way?
    • Our respnose variable is \(d\), the number of trees dying, which must be conditioned on \(n\), the number of trees that could die.

Mortality in Tsuga

Research question: Does mortality decrease with increasing precipitation?

  • We already used a binomial likelihood to model mortality. What if we modify our linear model this way?
    • Our respnose variable is \(d\), the number of trees dying, which must be conditioned on \(n\), the number of trees that could die.
    • For the binomial distribution, we already saw that \(\mathbb{E}(d|n) = \frac{d}{n} = p\), where \(p\) is the probability of an event

Mortality in Tsuga: Graphical model

Research question: Does mortality decrease with increasing precipitation?

  • We already used a binomial likelihood to model mortality. What if we modify our linear model this way?
    • Our respnose variable is \(d\), the number of trees dying, which must be conditioned on \(n\), the number of trees that could die.
    • For the binomial distribution, we already saw that \(\mathbb{E}(d|n) = \frac{d}{n} = p\), where \(p\) is the probability of an event
    • We can simply model this with a linear function as before, right?

\[ \begin{aligned} \eta_i & = \mathbb{E}(d_i|n_i) = a + b \times precip_i \\ d_i & \sim \mathrm{Binomial}(n_i, \eta_i) \end{aligned} \]

Mortality in Tsuga: Stan code

Research question: Does mortality decrease with increasing precipitation? \[ \begin{aligned} \eta_i & = \mathbb{E}(d_i|n_i) = a + b \times precip_i \\ d_i & \sim \mathrm{Binomial}(n_i, \eta_i) \end{aligned} \]

data {
    int <lower = 1> nobs;
    vector [nobs] precip;
    int <lower = 1> n [nobs];
    int <lower = 0> died [nobs];
}
parameters {
    real a;
    real b;
}
transformed parameters {
    vector [nobs] eta = a + b * precip;
}
model {
    died ~ binomial(n, eta);
    a ~ normal(0, 10);
    b ~ normal(0, 10);
}

Mortality in Tsuga: Running the model

Research question: Does mortality decrease with increasing precipitation? \[ \begin{aligned} \eta_i & = \mathbb{E}(d_i|n_i) = a + b \times precip_i \\ d_i & \sim \mathrm{Binomial}(n_i, \eta_i) \end{aligned} \]

# drop NAs from the dataset, drop plots with no trees
tsuga_stan_dat = with(tsuga[n > 0 & complete.cases(tsuga), ], list(
    died = died,
    n = n,
    precip = tot_annual_pp
))
tsuga_stan_dat$nobs = length(tsuga_stan_dat$n)
tsuga_fit1 = sampling(mort_binom1, data = tsuga_stan_dat, refresh = 0)
## Error in eval(expr, envir, enclos) : Initialization failed.
## character(0)
## error occurred during calling the sampler; sampling not done

tsuga_fit1 = sampling(mort_binom1, data = tsuga_stan_dat, refresh = 0, 
                      init = list(list(a = 0.46, b = -2.5e-4)), chains=1)
## Warning: There were 871 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
## Warning: Examine the pairs() plot to diagnose sampling problems
## Warning: The largest R-hat is 1.51, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess

Improving the fit

Research question: Does mortality decrease with increasing precipitation?

Problem: This system cannot be linear!

\[ \begin{aligned} \eta_i & = \mathbb{E}(d_i|n_i) = a + b \times precip_i \\ d_i & \sim \mathrm{Binomial}(n_i, \eta_i) \end{aligned} \]

Improving the fit: Stan round 2

Research question: Does mortality decrease with increasing precipitation? \[ \begin{aligned} \ln \frac{p}{1-p} & = \eta = a + b \times precip \\ p & = \frac{e^{\eta}}{1 + e^{\eta}} & \\ d & \sim \mathrm{Binomial}(n, p) \end{aligned} \]

data {
    int <lower = 1> nobs;
    vector [nobs] precip;
    int <lower = 1> n [nobs];
    int <lower = 0> died [nobs];
}
parameters {
    real a;
    real b;
}
transformed parameters {
    vector [nobs] eta = a + b * precip;
    vector <lower = 0, upper = 1> [nobs] p = inv_logit(eta);
}
model {
    died ~ binomial(n, p);
    a ~ normal(0, 10);
    b ~ normal(0, 10);
}

Improving the fit: Stan round 2

Research question: Does mortality decrease with increasing precipitation? \[ \begin{aligned} \ln \frac{p}{1-p} & = \eta = a + b \times precip \\ p & = \frac{e^{\eta}}{1 + e^{\eta}} & \\ d & \sim \mathrm{Binomial}(n, p) \end{aligned} \]

tsuga_fit2 = sampling(mort_binom2, data = tsuga_stan_dat, refresh = 0)

The Generalised Linear Model

For an observed variable \(y\) and a matrix of predictors \(\mathbf{X}\):

  1. We can define the conditional expectation of \(y\):

\[ \mathbb{E}(y | \mathbf{X}) = \xi \]

The Generalised Linear Model

For an observed variable \(y\) and a matrix of predictors \(\mathbf{X}\):

  1. We can define the conditional expectation of \(y\).
  2. This expectation is connected to a linear predictor via a link function \(\mathcal{f}\)
    • With optional additional parameters \(\theta_{\mathcal{f}}\)

\[ \begin{aligned} \mathbb{E}(y | \mathbf{X}) & = \xi \\ \mathcal{f}(\xi, \theta_{\mathcal{f}}) & = \eta \\ \eta & = \mathbf{XB}\\ \end{aligned} \]

The Generalised Linear Model

For an observed variable \(y\) and a matrix of predictors \(\mathbf{X}\):

  1. We can define the conditional expectation of \(y\).
  2. This expectation is connected to a linear predictor via a link function \(\mathcal{f}\)
    • With optional additional parameters \(\theta_{\mathcal{f}}\)
  3. Randomness in our observation of \(y\) is generated by some distribution function \(\mathcal{D}\)
    • \(\mathcal{D}\) has optional free parameters \(\theta_{\mathcal{D}}\)
    • Often \(\theta_{\mathcal{D}}\) will include dispersion parameters

\[ \begin{aligned} \mathbb{E}(y | \mathbf{X}) & = \xi \\ \mathcal{f}(\xi, \theta_{\mathcal{f}}) & = \eta \\ \eta & = \mathbf{XB}\\ \\ y & \sim \mathcal{D}(\xi, \theta_{\mathcal{D}}) \end{aligned} \]

Example: The Gaussian model

For an observed variable \(y\) and a matrix of predictors \(\mathbf{X}\):

  1. We can define the conditional expectation of \(y\).
  2. The link function is the identity function
    • Which has no additional parameters (\(\theta_{\mathcal{f}} = \varnothing\))
  3. Randomness in our observation of \(y\) is generated by the Gaussian (i.e., Normal) PDF
    • Which has the additional free standard deviation parameter \(s\): \(\theta_{\mathcal{D}} = \{s\}\)

This is identical to our linear model from earlier! Multiple regression is a subset of the GLM.

\[ \begin{aligned} \mathbb{E}(y | \mathbf{X}) & = \xi \\ \mathcal{I}(\xi) & = \eta \\ \eta & = \mathbf{XB} \\ & \therefore\\ \mathbb{E}(y | \mathbf{X}) & = \mathbf{XB}\\ \\ y & \sim \mathcal{N}(\mathbf{XB}, s) \end{aligned} \]

Count Data: Poisson

\[ \begin{aligned} \lambda & = \exp(a + \mathbf{BX}) \\ y & \sim \mathrm{Poisson}(\lambda) \end{aligned} \]

Assumption: \(\sigma^2_{y|\mathbf{X}} = \mathbb{E}(y|\mathbf{X}) = \lambda\)

\[ y \sim \mathrm{Poisson}(u\lambda) \]

Overdispersed counts: Negative Binomial

\[ \begin{aligned} \xi & = \exp(a + \mathbf{BX}) \\ y & \sim \mathrm{NB}(\xi, \phi) \end{aligned} \]

\(\phi\) is called the dispersion parameter, and is related to the variance:

\[ \sigma^2_{y|\mathbf{X}} = \xi + \frac{\xi^2}{\phi} \]

Proportions: Beta

\[ \begin{aligned} \xi & = \mathrm{logit}^{-1}(a_\rho + \mathbf{B_\rho X_\rho}) \\ \phi & = \exp(a_\phi + \mathbf{B_\phi X_\phi}) & \mathrm{\small optionally} \\ \alpha & = \xi\phi \\ \beta & = (1 - \xi)\phi \\ y & \sim \mathrm{Beta}(\alpha, \beta) \end{aligned} \]

Caveats & Hints

Continuous, strictly positive: Gamma

\[ \begin{aligned} \xi & = \frac{1}{(a + \mathbf{B X})} & OR \\ \xi & = \exp(a + \mathbf{B X}) \\ \\ \alpha & = \frac{\xi^2}{\phi} \\ \\ \beta & = \frac{\xi}{\phi} \\ \\ y & \sim \mathrm{Gamma}(\alpha, \beta) \end{aligned} \]