Guidelines

Find the guidelines on the first exercise sheet.

Task 1: Univariate statistics and plots

For our first task, we will explore doing statistics on a single variable at a time, and on making and customising histograms and boxplots.

Begin by reading the penguin dataset that you saved to your computer for exercise 1, task 3. This time we will use the full dataset, not the subset.

In the following subtasks, you will be asked for summary statistics. Unless otherwise specified, please provide:

  • The mean: mean
  • Standard deviation: sd
  • Sample size: Many ways; try length() or nrow() or table()
  • Standard error (no function, you will need to do this manually)
  • The minimum and maximum: min() and max(), or range()

1.1. Choose one of the quantitative variables to explore. Produce summary statistics and make a histogram for that variable. Be sure to review the help file for the histogram function to customize your plot. You should label the axes and choose an appropriate number of breaks/bins for the dataset.

1.2. The dataset covers three different species of penguin and two different sexes, all of which likely have different body sizes. For now we will work with a single subsetting variable, species. See if you can figure out how to compute summary statistics for each species separately using the tapply() function. How do the different species differ in whatever variable you chose?

1.3 Produce a histogram for each species. This should be presented on a single figure with three panels in a column (one per species). You can do this by running par(mfrow=c(3,1)) before you make your three histograms. How would you change this code to make the figure wide (so one row with the histograms side-by-side)? Some other things to try:

  • The x-axis label only needs to be on the bottom histogram. Turn it off for the other two.
  • Make sure the x-axis limits are the same for all 3 histograms, so you can directly compare them.
  • Include the species name somewhere on each plot.
  • Look at the color scheme on the official Palmer penguins website. Change the colours filling the bars to match the colours for each species on the website.

1.4 Boxplots are useful for visualising datasets with multiple categorical variables. You can create one with the following syntax:

boxplot(variable ~ category1 + category2, data = dataframe)

Be sure to substitute the names: dataframe is the name of the data frame you will use, and variable, category1 and category2 are the names of the variables inside the dataframe.

Use the same numeric variable as your variable here, and species and sex as the categorical variables.

Unfortunately, the default boxplot isn’t very nice to look at. Explore the help file or search online to see if you can make the plot better. Do not use ggplot for this exercise, we will come to this later. Some suggestions to help you improve the plot:

  • Change the font size of the labels
  • Change the text of the labels
  • Change the fill colour of the boxes
  • Change the spacing of the boxes so that the boxes are “grouped” by category1
  • Add a notch
  • What happens if you change the order of category1 and category2? Which do you prefer?

You don’t need to turn in every attempt you make here. Just tweak the figure until it looks good, and turn in the final version.

Task 2: Bivariate plots

Choose a single species to work with, and subset the data to make a data frame with only that species. Then choose two variables, one will be your x variable and one the y (it doesn’t matter which is which).

2.1. Compute summary statistics for both variables, separately for male and female. In addition to the usual statistics, use the cor() function to compute the correlation between the two variables.

2.2. Produce a bivariate scatterplot showing the correlation between the two variables. You can choose to either show males and females on separate plots (but on a single figure, with identical x- and y- axis limits), or together on a single plot. If you choose a single plot, you should use color to show the sexes.

2.3 You can use the lm() function to calculate a best-fit line for the data. The syntax looks like this:

mod = lm(y ~ x, data = dataframe)

You need to substitute the variable names for y and x, and the name of the data frame for dataframe. Make a linear model separately for males and females (use subset() to create separate data frames for the two sexes, and remember to save them in different variables, such as mod_male and mod_female). You can use the print() function to view the intercept and slope. Are the two slopes very different?

2.4. Add the best-fit line to your plot(s) from 2.2. To do this, you need to make a plot annotation. This is a function you run after you make a plot to add something to the plot. Repeat the plot() command to make the first plot from 2.2. Then, on the next line, run abline(model) (substitute the model name for model). You can also add colour to match the colours from the plot.

Hint: If you made two separate plots for each sex, you will need to run the first plot(), then abline(), then the second plot(), then the second abline(), in that order.