1. Load the tsuga dataset from this repository.
library(data.table)
## working directory should be: vu_advstats_students
tsuga = readRDS("data/tsuga.rds")

## if you don't have the repository saved locally
# tsuga = readRDS(url("https://github.com/ltalluto/vu_advstats_students/raw/main/data/tsuga.rds"))

This dataset gives statistics about Tsuga canadensis observed over multiple years in forest plots in eastern Canada. Included are the number of trees born, the number observed to have died that year, the total number of trees (including dead ones) n, and the climate. Filter the dataset so that it contains only observations from the year 2005 with at least 1 individual (n > 0)

  1. Write a function taking three parameters:
    1. p: a single value, the probability that a randomly chosen individual is dead
    2. n: a vector, the number of trees in each plot
    3. k: a vector, the number of dead trees in each plot
    4. The function should return the log likelihood: \(\log pr(n,k|p)\)
# n and k are vectors
lfun = function(p, n, k) {
   ## function body
   return() ## a single value!
}
  1. Plot the log likelihood across various values of p
    • Hint: you will need to run lfun once for each value of p. This is most efficiently accomplished using sapply, but a for loop will also work.
  2. Use optim to find the MLE for p
  3. Is the answer different from mean(dat$died/dat$n)? If so, why?
  4. Write two more functions, one to estimate a prior for p, and one to compute the log posterior. You may choose the prior distribution and hyperparameters as you like, but they should respect the constraint that p must be between zero and one.
  5. Plot the prior and unnormalized posterior. Compare plots with different prior hyperparameters
  6. Compute the maximum a posteriori estimate for p using optim.
  7. Write a stan model to compute the maximum a posteriori estimate for p, and then fit it using optimizing.

Bonus

  1. Write a Stan program to estimate the mean number of trees per plot.
    1. What kind of statistical process could generate these data?
    2. What is an appropriate likelihood function? prior?
    3. Compute the MAP estimate.
    4. How does the MAP compare to mean(dat$n)?
  2. Use the MAP estimate from 2c and the r****() function corresponding to the likelihood function to simulate a new dataset, with as many observations as in the original data.
    1. Compare a histogram and summary statistics of your simulated data with the real data. Note any similarities and differences.
    2. Consider how you could improve this generative model to better describe the dataset. Do you need a new likelihood function?