For these exercises, you will need the Howell1
dataset
from the rethinking
package. We don’t need the rest of the
package, so we will just download the data directly from the package’s
github repository
library(data.table)
Howell1 = fread("https://github.com/rmcelreath/rethinking/raw/master/data/Howell1.csv")
These data, collected by anthropologist Nancy Howell in the 1960s,
provide the age, height, weight, and sex sampled from a population of
!Kung in Dobe (Namibia & Botswana). Note that the male
variable is categorical, and is equal to 1 if the subject is male, and 0
if female.
- Do some graphical data exploration. The data are multivariate, so
multiple plots may be necessary.
- Design a regression model with
height
as the
response (i.e., outcome, y) variable. For your initial
model, use only weight
as a predictor.
- Build a graphical model as shown in lecture.
- Translate the graphical model into equations. Be sure that all
unknowns (including parameters) appear on the left side of either ~ or
=.
- Fill in a Stan model block with the likelihood and priors for all
stocahstic nodes
- Complete your Stan program, then get a MAP estimate using
optimizing
.
- Make a plot of the best-fit line; include the original data on the
plot as well.
- Plotting this model will likely reveal it to be inadequate, because
there is a substantial “curve” in the height-weight relationship. Repeat
the exercise for #2, but update your model to better predict height.
Possible approaches might include filtering the data (but I encourage
you to use it all if possible), writing a curvilinear equation, or
adding additional predictors. Compare the results of your attempts, and
choose a single model that you think works “best.” Repeat a-e above for
this model. Note any changes in the relationship between height and
weight caused by your changes to the model. Additionally:
- Estimate the entire posterior using
sampling
- What are the credible intervals for the parameters of your
model?
- What is the probability that the height of a 24-year-old female with
a weight of 41 kg is between 130 and 145 cm? What about for a male?
- Compute a posterior prediction interval for each case in the
original dataset. You should have a median or MAP estimate and an
interval for height along side the actual measured values for height
from the original data. Plot the original height observations on the
x-axis, and the predicted MAP/median on the y-axis. Experiment with the
segments
function (or geom_errorbars
in
ggplot2
) to see if you can draw the intervals as vertical
lines on the plot. Add a 1:1 line to the plot. What does this plot tell
you about your model fit? What is the expected relationship for a “good”
model?
Bonus
- Return to one of the plots you made showing height against weight.
It appears that the variance in height is not constant with respect to
weight. This is one of the key assumptions in linear regression, and you
probably made this implicitly in your model (the s parameter, the
standard deviation, is constant). However, Bayesian models need not be
so rigid.
- Can you design a model that allows the variance to increase as
weight increases? Use whatever predictors for height that you think are
best.
- Fit the model and compare the fit to your original model.
- Buried in the bottom of a field notebook, you find two cases that
were missing from the original dataset. The first is an individual with
a weight of 43.72; no height, age, or sex data is recorded. The second
is a 38-year-old female with a height of 135.
- Using the model from #3 above, can you predict a 90% credible
interval for height of the first missing case? Is it easier if you use
the model from #2?
- The second missing case is more interesting; we have the outcome of
our model, but we are missing the weight. How could you estimate a 90%
CI for weight, using the model from #3 (i.e., keeping height as the
response variable)?