For these exercises, you will need the Howell1 dataset from the rethinking package. We don’t need the rest of the package, so we will just download the data directly from the package’s github repository

library(data.table)
Howell1 = fread("https://github.com/rmcelreath/rethinking/raw/master/data/Howell1.csv")

These data, collected by anthropologist Nancy Howell in the 1960s, provide the age, height, weight, and sex sampled from a population of !Kung in Dobe (Namibia & Botswana). Note that the male variable is categorical, and is equal to 1 if the subject is male, and 0 if female.

  1. Do some graphical data exploration. The data are multivariate, so multiple plots may be necessary.
  2. Design a regression model with height as the response (i.e., outcome, y) variable. For your initial model, use only weight as a predictor.
    1. Build a graphical model as shown in lecture.
    2. Translate the graphical model into equations. Be sure that all unknowns (including parameters) appear on the left side of either ~ or =.
    3. Fill in a Stan model block with the likelihood and priors for all stocahstic nodes
    4. Complete your Stan program, then get a MAP estimate using optimizing.
    5. Make a plot of the best-fit line; include the original data on the plot as well.
  3. Plotting this model will likely reveal it to be inadequate, because there is a substantial “curve” in the height-weight relationship. Repeat the exercise for #2, but update your model to better predict height. Possible approaches might include filtering the data (but I encourage you to use it all if possible), writing a curvilinear equation, or adding additional predictors. Compare the results of your attempts, and choose a single model that you think works “best.” Repeat a-e above for this model. Note any changes in the relationship between height and weight caused by your changes to the model. Additionally:
    1. Estimate the entire posterior using sampling
    2. What are the credible intervals for the parameters of your model?
    3. What is the probability that the height of a 24-year-old female with a weight of 41 kg is between 130 and 145 cm? What about for a male?
    4. Compute a posterior prediction interval for each case in the original dataset. You should have a median or MAP estimate and an interval for height along side the actual measured values for height from the original data. Plot the original height observations on the x-axis, and the predicted MAP/median on the y-axis. Experiment with the segments function (or geom_errorbars in ggplot2) to see if you can draw the intervals as vertical lines on the plot. Add a 1:1 line to the plot. What does this plot tell you about your model fit? What is the expected relationship for a “good” model?

Bonus

  1. Return to one of the plots you made showing height against weight. It appears that the variance in height is not constant with respect to weight. This is one of the key assumptions in linear regression, and you probably made this implicitly in your model (the s parameter, the standard deviation, is constant). However, Bayesian models need not be so rigid.
    1. Can you design a model that allows the variance to increase as weight increases? Use whatever predictors for height that you think are best.
    2. Fit the model and compare the fit to your original model.
  2. Buried in the bottom of a field notebook, you find two cases that were missing from the original dataset. The first is an individual with a weight of 43.72; no height, age, or sex data is recorded. The second is a 38-year-old female with a height of 135.
    1. Using the model from #3 above, can you predict a 90% credible interval for height of the first missing case? Is it easier if you use the model from #2?
    2. The second missing case is more interesting; we have the outcome of our model, but we are missing the weight. How could you estimate a 90% CI for weight, using the model from #3 (i.e., keeping height as the response variable)?