Introduction to R for Biologists - Lauren Talluto
penguins = read.csv("data/penguins.csv")y = penguins$body_mass_gsummary(y)# minimum, maximumrange(y)min(y)max(y)# central tendencymean(y)median(y)# variabilityvar(y)sd(y)
The penguin dataset has some NA
values.
Check the help functions: ?mean
. Do you see any option for dealing with NAs?
You can use subset()
with complete.cases()
to remove ALL rows that have at least one missing value.
penguins = read.csv("data/penguins.csv")nrow(penguins)## [1] 344
head(penguins)## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## 1 Adelie Torgersen 39.1 18.7 181 3750## 2 Adelie Torgersen 39.5 17.4 186 3800## 3 Adelie Torgersen 40.3 18.0 195 3250## 4 Adelie Torgersen NA NA NA NA## 5 Adelie Torgersen 36.7 19.3 193 3450## 6 Adelie Torgersen 39.3 20.6 190 3650## sex year## 1 male 2007## 2 female 2007## 3 female 2007## 4 <NA> 2007## 5 female 2007## 6 male 2007
You can use subset()
with complete.cases()
to remove ALL rows that have at least one missing value.
penguins_no_na = subset(penguins, complete.cases(penguins))nrow(penguins_no_na)## [1] 333
head(penguins_no_na)## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g## 1 Adelie Torgersen 39.1 18.7 181 3750## 2 Adelie Torgersen 39.5 17.4 186 3800## 3 Adelie Torgersen 40.3 18.0 195 3250## 5 Adelie Torgersen 36.7 19.3 193 3450## 6 Adelie Torgersen 39.3 20.6 190 3650## 7 Adelie Torgersen 38.9 17.8 181 3625## sex year## 1 male 2007## 2 female 2007## 3 female 2007## 5 female 2007## 6 male 2007## 7 female 2007
Our data have meaningful partitions, especially by species
and sex
!
Our summary statistics will make more sense if we compute them seperately by species/sex.
with(penguins_no_na, plot(body_mass_g, bill_length_mm, col = factor(species), pch = 16))
mean(penguins_no_na$bill_length_mm[penguins_no_na$species == "Adelie"])## [1] 38.82397
mean(penguins_no_na$bill_length_mm[penguins_no_na$species == "Chinstrap"])## [1] 48.83382
mean(penguins_no_na$bill_length_mm[penguins_no_na$species == "Gentoo"])## [1] 47.56807
tapply(penguins_no_na$bill_length_mm, # first the variable you want to summarise penguins_no_na$species, # then the partitioning variable mean) # then the function you want to use for summarising## Adelie Chinstrap Gentoo ## 38.82397 48.83382 47.56807
tapply(penguins_no_na$bill_length_mm, penguins_no_na[, c('species', 'sex')], mean)## sex## species female male## Adelie 37.25753 40.39041## Chinstrap 46.57353 51.09412## Gentoo 45.56379 49.47377
A brief summary
A histogram shows the distribution of a single variable.
The default from hist
is pretty ugly.
hist(penguins_no_na$body_mass_g)
There are many options you can adjust.
hist(penguins_no_na$body_mass_g, main = "", # disables the title xlab = "Penguin Body Mass (g)", breaks = 15, # adjust the number of bins col = 'skyblue', border = 'darkblue')
col = 'rosybrown'
colors()
function for the names#RRGGBB
00
(none) to FF
(most)hist(penguins_no_na$body_mass_g, col = "#77aa11", border = "#005500")
For comparing how two variables covary
plot(penguins$body_mass_g, penguins$bill_length_mm, xlab = "Body Mass (g)", ylab = "Bill Length (mm)")
I can create a variable to store the colour I want to use for each point.
penguins$color[penguins$species == "Adelie"] = "#ff561b"penguins$color[penguins$species == "Chinstrap"] = "#b952c0"penguins$color[penguins$species == "Gentoo"] = "#00676a"plot(penguins$body_mass_g, penguins$bill_length_mm, xlab = "Body Mass (g)", ylab = "Bill Length (mm)", col = penguins$color, pch = 16) # pch=16 uses a solid circle
I can do the same for sex. ifelse()
works great if you have 2 categories!
penguins$symbol = ifelse(penguins$sex == "male", 15, 17)plot(penguins$body_mass_g, penguins$bill_length_mm, xlab = "Body Mass (g)", ylab = "Bill Length (mm)", col = penguins$color, pch = penguins$symbol)
You may annotate plots by running commands after you use the plot()
command. Things to try:
legend() # create a legendabline() # create a line if you know intercept/slopelines() # more general function for adding linespoints() # add x-y pointstext() # Add text annotationsmtext() # Add text to plot margins
plot(penguins$body_mass_g, penguins$bill_length_mm, xlab = "Body Mass (g)", ylab = "Bill Length (mm)", col = penguins$color, pch = penguins$symbol)legend("bottomright", legend = c("Adelie", "Chinstrap", "Gentoo", "male", "female"), col = c("#ff561b", "#b952c0", "#00676a", "black", "black"), pch = c(16, 16, 16, 15, 17))
y ~ group
is a special data type called a formula
boxplot(body_mass_g ~ species, data = penguins, boxwex=0.4, notch = TRUE)
cols = c("#ff561b", "#b952c0", "#00676a")boxplot(body_mass_g ~ species+sex, data = penguins, # set the colours, repeated twice (female and male) col = rep(cols, 2), # axis label text size cex.axis = 0.8, # labels under the boxes names = c("", "Female", "", "", "Male", ""), # set axis titles xlab = "", ylab = "Body mass (g)", # disable the box around the plot bty = 'n', notch = TRUE)# add a legend to the plotlegend("topleft", legend = unique(penguins$species), title = "Species", fill = cols, bty = 'n')
Add a line with abline()
to distinguish male vs female.
Or experiment with the at =
argument.
penguins = read.csv("data/penguins.csv")y = penguins$body_mass_gsummary(y)# minimum, maximumrange(y)min(y)max(y)# central tendencymean(y)median(y)# variabilityvar(y)sd(y)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |