Find the guidelines on the first exercise sheet.
For our first task, we will explore doing statistics on a single variable at a time, and on making and customising histograms and boxplots.
Begin by reading the penguin dataset that you saved to your computer for exercise 1, task 3. This time we will use the full dataset, not the subset.
In the following subtasks, you will be asked for
summary statistics
. Unless otherwise specified, please
provide:
mean
sd
length()
or nrow()
or table()
min()
and max()
, or range()
1.1. Choose one of the quantitative variables to explore. Produce
summary statistics
and make a histogram for that variable.
Be sure to review the help file for the histogram function to customize
your plot. You should label the axes and choose an appropriate number of
breaks/bins for the dataset.
1.2. The dataset covers three different species of penguin and two
different sexes, all of which likely have different body sizes. For now
we will work with a single subsetting variable, species.
See if you can figure out how to compute summary statistics
for each species separately using the tapply()
function.
How do the different species differ in whatever variable you chose?
1.3 Produce a histogram for each species. This should be presented on
a single figure with three panels in a column (one per species). You can
do this by running par(mfrow=c(3,1))
before you make your
three histograms. How would you change this code to make the figure wide
(so one row with the histograms side-by-side)? Some other things to
try:
1.4 Boxplots are useful for visualising datasets with multiple categorical variables. You can create one with the following syntax:
boxplot(variable ~ category1 + category2, data = dataframe)
Be sure to substitute the names: dataframe
is the name
of the data frame you will use, and variable
,
category1
and category2
are the names of the
variables inside the dataframe.
Use the same numeric variable as your variable
here, and
species
and sex
as the categorical
variables.
Unfortunately, the default boxplot isn’t very nice to look at.
Explore the help file or search online to see if you can make the plot
better. Do not use ggplot
for this
exercise, we will come to this later. Some suggestions to help you
improve the plot:
category1
category1
and
category2
? Which do you prefer?You don’t need to turn in every attempt you make here. Just tweak the figure until it looks good, and turn in the final version.
Choose a single species to work with, and subset the data to make a
data frame with only that species. Then choose two variables, one will
be your x
variable and one the y
(it doesn’t
matter which is which).
2.1. Compute summary statistics
for both variables,
separately for male and female. In addition to the usual statistics, use
the cor()
function to compute the correlation between the
two variables.
2.2. Produce a bivariate scatterplot showing the correlation between the two variables. You can choose to either show males and females on separate plots (but on a single figure, with identical x- and y- axis limits), or together on a single plot. If you choose a single plot, you should use color to show the sexes.
2.3 You can use the lm()
function to calculate a
best-fit line for the data. The syntax looks like this:
mod = lm(y ~ x, data = dataframe)
You need to substitute the variable names for y
and
x
, and the name of the data frame for
dataframe
. Make a linear model separately for males and
females (use subset()
to create separate data frames for
the two sexes, and remember to save them in different variables, such as
mod_male
and mod_female
). You can use the
print()
function to view the intercept and slope. Are the
two slopes very different?
2.4. Add the best-fit line to your plot(s) from 2.2. To do this, you
need to make a plot annotation. This is a function you
run after you make a plot to add something to the plot. Repeat the
plot()
command to make the first plot from 2.2. Then, on
the next line, run abline(model)
(substitute the model name
for model
). You can also add colour to match the colours
from the plot.
Hint: If you made two separate plots for each sex,
you will need to run the first plot()
, then
abline()
, then the second plot()
, then the
second abline()
, in that order.