Lecture 2

Summary Statistics & Base Graphics

Introduction to R for Biologists - Lauren Talluto

1 / 25

Summary Statistics in R

Helpful functions to try

penguins = read.csv("data/penguins.csv")
y = penguins$body_mass_g
summary(y)
# minimum, maximum
range(y)
min(y)
max(y)
# central tendency
mean(y)
median(y)
# variability
var(y)
sd(y)

2 / 25

Dealing with missing values

The penguin dataset has some NA values.

Check the help functions: ?mean. Do you see any option for dealing with NAs?

The help file for the mean function

3 / 25

Removing missing values

You can use subset() with complete.cases() to remove ALL rows that have at least one missing value.

penguins = read.csv("data/penguins.csv")
nrow(penguins)
## [1] 344

head(penguins)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 4  Adelie Torgersen             NA            NA                NA          NA
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 4   <NA> 2007
## 5 female 2007
## 6   male 2007

4 / 25

Removing missing values

You can use subset() with complete.cases() to remove ALL rows that have at least one missing value.

penguins_no_na = subset(penguins, complete.cases(penguins))
nrow(penguins_no_na)
## [1] 333

head(penguins_no_na)
##   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1  Adelie Torgersen           39.1          18.7               181        3750
## 2  Adelie Torgersen           39.5          17.4               186        3800
## 3  Adelie Torgersen           40.3          18.0               195        3250
## 5  Adelie Torgersen           36.7          19.3               193        3450
## 6  Adelie Torgersen           39.3          20.6               190        3650
## 7  Adelie Torgersen           38.9          17.8               181        3625
##      sex year
## 1   male 2007
## 2 female 2007
## 3 female 2007
## 5 female 2007
## 6   male 2007
## 7 female 2007

5 / 25

Dealing with partitioned data

Our data have meaningful partitions, especially by species and sex!

Our summary statistics will make more sense if we compute them seperately by species/sex.

with(penguins_no_na,
    plot(body_mass_g, bill_length_mm, col = factor(species), pch = 16)
)

6 / 25

Computing by partition, the slow way

mean(penguins_no_na$bill_length_mm[penguins_no_na$species == "Adelie"])
## [1] 38.82397

mean(penguins_no_na$bill_length_mm[penguins_no_na$species == "Chinstrap"])
## [1] 48.83382

mean(penguins_no_na$bill_length_mm[penguins_no_na$species == "Gentoo"])
## [1] 47.56807

7 / 25

Computing by partition, the smarter way

tapply(penguins_no_na$bill_length_mm, # first the variable you want to summarise
       penguins_no_na$species, # then the partitioning variable
       mean) # then the function you want to use for summarising
##    Adelie Chinstrap    Gentoo 
##  38.82397  48.83382  47.56807

8 / 25

Multiple partitions

tapply(penguins_no_na$bill_length_mm, 
       penguins_no_na[, c('species', 'sex')], 
       mean)
##            sex
## species       female     male
##   Adelie    37.25753 40.39041
##   Chinstrap 46.57353 51.09412
##   Gentoo    45.56379 49.47377

9 / 25

R base graphics

A brief summary

10 / 25

Histograms

A histogram shows the distribution of a single variable.

The default from hist is pretty ugly.

hist(penguins_no_na$body_mass_g)

11 / 25

Histograms

There are many options you can adjust.

hist(penguins_no_na$body_mass_g,
     main = "", # disables the title
     xlab = "Penguin Body Mass (g)",
     breaks = 15, # adjust the number of bins
     col = 'skyblue', border = 'darkblue')

12 / 25

Colours - named colours

R has 657 named colours: col = 'rosybrown'
see the colors() function for the names

13 / 25

Colours - 24-bit colour

R supports colours using the common HTML color coding: #RRGGBB
- RR, GG, BB are the amounts of red, green, and blue
- each ranges from 00 (none) to FF (most)
- Colour pickers online help you translate a color in real life to a coded color

hist(penguins_no_na$body_mass_g, col = "#77aa11", border = "#005500")

14 / 25

Colour advice

Best practise: choose colours using reputable packages based in colour theory
- scico (continuous data)
- viridis (continuous data)
- rcolorbrewer (Categorical data)
- iWantHue (generates custom categorical palettes)

15 / 25

Scatterplots

For comparing how two variables covary

plot(penguins$body_mass_g, penguins$bill_length_mm, 
     xlab = "Body Mass (g)", ylab = "Bill Length (mm)")

16 / 25

Scatterplots

I can create a variable to store the colour I want to use for each point.

penguins$color[penguins$species == "Adelie"] = "#ff561b"
penguins$color[penguins$species == "Chinstrap"] = "#b952c0"
penguins$color[penguins$species == "Gentoo"] = "#00676a"
plot(penguins$body_mass_g, penguins$bill_length_mm, 
     xlab = "Body Mass (g)", ylab = "Bill Length (mm)",
     col = penguins$color, pch = 16) # pch=16 uses a solid circle

17 / 25

Scatterplots

I can do the same for sex. ifelse() works great if you have 2 categories!

penguins$symbol = ifelse(penguins$sex == "male", 15, 17)
plot(penguins$body_mass_g, penguins$bill_length_mm, 
     xlab = "Body Mass (g)", ylab = "Bill Length (mm)",
     col = penguins$color, pch = penguins$symbol)

18 / 25

Annotations

You may annotate plots by running commands after you use the plot() command. Things to try:

legend() # create a legend
abline() # create a line if you know intercept/slope
lines() # more general function for adding lines
points() # add x-y points
text() # Add text annotations
mtext() # Add text to plot margins

19 / 25

Add a legend

plot(penguins$body_mass_g, penguins$bill_length_mm, 
     xlab = "Body Mass (g)", ylab = "Bill Length (mm)",
     col = penguins$color, pch = penguins$symbol)
legend("bottomright", 
       legend = c("Adelie", "Chinstrap", "Gentoo", "male", "female"),
       col = c("#ff561b", "#b952c0", "#00676a", "black", "black"), 
       pch = c(16, 16, 16, 15, 17))

20 / 25

Boxplots: For summarizing across partitionsBoxplots (sometimes box-and-whisker diagrams) summarize key statistics.median
first and third quartiles (hinges)
approx. 95% confidence interval for median (notch)
min/max (or quartile + 1.5*IQR) (whiskers)

They are very useful for comparing variables among groups.
y ~ group is a special data type called a formula
21 / 25

Boxplots: For summarizing across partitions

boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)

22 / 25

Making a boxplot

cols = c("#ff561b", "#b952c0", "#00676a")
boxplot(body_mass_g ~ species+sex, data = penguins, 
        # set the colours, repeated twice (female and male)
        col = rep(cols, 2), 
        # axis label text size
        cex.axis = 0.8,
        # labels under the boxes
        names = c("", "Female", "", "", "Male", ""), 
        # set axis titles
        xlab = "", ylab = "Body mass (g)", 
        # disable the box around the plot
        bty = 'n', 
        notch = TRUE
)
# add a legend to the plot
legend("topleft", legend = unique(penguins$species), 
       title = "Species", fill = cols, bty = 'n')

23 / 25

Boxplots: For summarizing across partitions

24 / 25

Boxplots: For summarizing across partitions

Add a line with abline() to distinguish male vs female.

Or experiment with the at = argument.

25 / 25

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help