Welcome to the fourth day of the FAS Informatics Intro to R workshop!
This is the workshop file that should be opened in RStudio. This is an RMarkdown file, meaning that both formatted text and code blocks can be added to it, and the code blocks can be executed from the RStudio interface. Today we will be going through this document and executing the code blocks within it to demonstrate some basic R programming concepts.
For this part of the workshop, you can do one of three things using your new R skills:
For any option, we will be available to help! Below we outline the framework for the bird collision analysis. For the other options, feel free to use this document and overwrite the code blocks, or create your own R Markdown document and start from scratch.
Our example dataset is going to be analyzing some data on collisons between migrating birds in skyscrapers collected in Chicago and Cleveland, published in 2019. The data is available on Dryad.
#Set up and download data
library(tidyverse)
library(janitor)
raw_birds <- read_csv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/Chicago_collision_data.csv") %>%
janitor::clean_names()
raw_light <- read_csv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/Light_levels_dryad.csv") %>%
janitor::clean_names()
raw_call <- read_csv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/bird_call.csv") %>%
janitor::clean_names()
The first thing we want to do is to try to understand the data.
The raw_birds dataset, downloaded from dryad, has a single row for each observed collison, with a Genus/Species, a date, and the location (here, MP stands for McCormick Place and CHI stands for elsewhere in Chicago).
The raw_light dataset, also downloaded from dryad, has light readings from McCormick Place, where higher means more light – this is just the number of lit windows at a certain time of night.
The raw_call dataset is from the supplemental materials of the paper, and it lists 91 species with at least one collison ib the data, and includes some additional data for each species, including the total number of collisons, the genus/species/family, whether the bird uses nocturnal flight calls, some habitat information, and whether the bird is typically a ground bird (stratum = Lower) or a canopy bird (stratum = Upper).
You can try glimpse
and View
to get some basic summaries. Use the code block below to get some basic summary information about the data.
glimpse(raw_birds)
## Rows: 69,784
## Columns: 4
## $ genus <chr> "Ammodramus", "Ammodramus", "Ammodramus", "Ammodramus", "Ammo~
## $ species <chr> "nelsoni", "nelsoni", "nelsoni", "nelsoni", "nelsoni", "nelso~
## $ date <date> 1982-10-03, 1984-05-21, 1984-05-25, 1985-10-08, 1986-09-10, ~
## $ locality <chr> "MP", "CHI", "MP", "MP", "MP", "MP", "MP", "MP", "MP", "MP", ~
glimpse(raw_light)
## Rows: 3,067
## Columns: 2
## $ date <date> 2000-03-06, 2000-03-08, 2000-03-10, 2000-03-31, 2000-04-0~
## $ light_score <dbl> 3, 15, 3, 3, 17, 4, 4, 14, 14, 3, 14, 16, 4, 4, 4, 14, 3, ~
glimpse(raw_call)
## Rows: 91
## Columns: 7
## $ genus <chr> "Zonotrichia", "Junco", "Melospiza", "Melospiza", "Seiurus~
## $ species <chr> "albicollis", "hyemalis", "melodia", "georgiana", "aurocap~
## $ family <chr> "Passerellidae", "Passerellidae", "Passerellidae", "Passer~
## $ collisions <dbl> 10133, 6303, 5124, 4910, 4580, 3729, 2676, 2515, 2443, 233~
## $ flight_call <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Y~
## $ habitat <chr> "Forest", "Edge", "Edge", "Open", "Forest", "Forest", "For~
## $ stratum <chr> "Lower", "Lower", "Lower", "Lower", "Lower", "Lower", "Upp~
Now there are a couple of cleaning steps we might want to take.
First, we probably want to merge these three datasets into a single dataset for analysis, where each row is an observed collision and we include both collision variables (date, locality, light level) and species variables in our dataset. This is a little complicated, because we want to merge the light levels only for the rows where locality is “MP” - the light levels are only recorded at the MP locality.
One way we can do this is create a column in the light levels dataframe that is called “locality” and assign it to “MP”, and then merge by both date and locality.
This step is a bit complicated, so we’ll give you the code.
raw_light <- raw_light %>% mutate(locality = "MP")
# add locality information to the light level data
bird_crash <- raw_birds %>%
left_join(raw_call, by=c("genus", "species")) %>%
# merge the raw collisions dataset (raw_birds) with the information about each bird species in the raw_call dataset
unite("sci_name", c("genus", "species"), remove = FALSE) %>%
# create a new variable, sci_name, with the Latin binomial (scientific name) of each species
# remove = FALSE means we want to keep the original genus and species columns as well as the new colum
left_join(raw_light, by=c("date", "locality"))
## Warning in left_join(., raw_light, by = c("date", "locality")): Each row in `x` is expected to match at most 1 row in `y`.
## i Row 16306 of `x` matches multiple rows.
## i If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
We use the join
family of functions to merge data by matching columns, and make a new variable called g_s
, for the scientific name of the species from the genus and species columns.
But we get a warning…Warning: Each row in
xis expected to match at most 1 row in
y.
It seems to be telling us that somewhere in our data, we are duplicating rows in x. That is, we have some entries in either the raw_light tibble or the raw_call tibble that match mulitiple rows in the raw_birds tibble. This shouldn’t happen and means we’ll get some duplicated rows.
Use the code block below to explore the raw_light and raw_call tibble to see if you can see if either one has multiple observations of a single genus/species (raw_call) or a single date (raw_light).
raw_call %>% group_by(genus, species) %>% summarize(n=n()) %>% filter(n!=1)
## `summarise()` has grouped output by 'genus'. You can override using the
## `.groups` argument.
## # A tibble: 0 x 3
## # Groups: genus [0]
## # ... with 3 variables: genus <chr>, species <chr>, n <int>
raw_call %>% unite("sci_name", c("genus", "species")) %>% group_by(sci_name) %>% summarize(n=n()) %>% filter(n!=1)
## # A tibble: 0 x 2
## # ... with 2 variables: sci_name <chr>, n <int>
raw_light %>% group_by(date) %>% summarize(n=n()) %>% filter(n!=1)
## # A tibble: 5 x 2
## date n
## <date> <int>
## 1 2000-05-09 2
## 2 2003-09-02 2
## 3 2007-08-29 2
## 4 2009-09-18 2
## 5 2017-03-17 2
#raw_birds %>% group_by(genus, species) %>% summarize(n=n()) %>% filter(n!=1)
#raw_birds %>% group_by(date) %>% summarize(n=n()) %>% filter(n!=1)
Once you figure out where the problem is, use the code block below to fix it, and rerun the join command.
You can use the distinct() function with the .keep_all = TRUE
argument (note the . at the beginning of keep_all) to filter a dataset to keep all columns but only rows with unique values for a particular column or columns. So distinct(raw_call, family, .keep_all = TRUE) would return a new data frame with only one row per unique value of the family column. The first occurrence is kept, so you might need to sort your data with arrange()
first.
## Remove duplicate values from dataset
raw_light = raw_light %>% arrange(date) %>% distinct(date, .keep_all=TRUE)
## Remove duplicate values from dataset
bird_crash <- raw_birds %>%
left_join(raw_call, by=c("genus", "species")) %>%
# merge the raw collisions dataset (raw_birds) with the information about each bird species in the raw_call dataset
unite("sci_name", c("genus", "species"), remove = FALSE) %>%
# create a new variable, sci_name, with the Latin binomial (scientific name) of each species
# remove = FALSE means we want to keep the original genus and species columns as well as the new colum
left_join(raw_light, by=c("date", "locality"))
One thing you’ll see is that we have a record of the number of collisions for each species in the collisions column. We should have exactly that many rows for each species. We can use the add_count
function to add another column that gives us the group-wise counts, and then compare that new column to the collisions column.
We also at this point probably want to look at a few more data summaries - check for NAs, for example. There is a useful bit of code we’ve included below to get the number of NAs in each column.
bird_crash %>%
add_count(sci_name, sort = TRUE) %>%
select(collisions, n) %>% distinct() %>% filter(collisions != n)
## # A tibble: 0 x 2
## # ... with 2 variables: collisions <dbl>, n <int>
# we expect this to be empty, since we filter for rows where the number of collisions we count from the raw data does not equal the number the paper reported for that species
colSums(is.na(bird_crash))
## sci_name genus species date locality family
## 0 0 0 0 0 89
## collisions flight_call habitat stratum light_score
## 89 89 89 89 60147
# remember that is.na() returns a logical vector (or, in this case, a data frame); when we run a sum on a logical vector, TRUE = 1 and FALSE = 0, so the sum of a logical vector or matrix is the same as counting the number of TRUEs
Great, the numbers match! This is what we want - no warnings and our processed data matches what we expect.
However, we also see there are a few NAs in the the family, collisions, flight_call, habitat, and stratum columns, as well as lots in the light_score. We expected the latter, but not the former. These must be species that are included in the dataset, but for which no biological information on habitat, nocturnal flight calls, or other details were reported in the paper supplemental material. Let’s look at these.
bird_crash %>% filter(is.na(family)) %>% select(sci_name) %>% group_by(sci_name) %>% summarize(obs=n())
## # A tibble: 3 x 2
## sci_name obs
## <chr> <int>
## 1 Ammodramus_henslowii 9
## 2 Ammodramus_leconteii 25
## 3 Ammodramus_nelsoni 55
What could be going on here? This is the sort of thing that needs to be investigated - NAs where you don’t expect them are often a sign that there is a typo somewhere, for example, and the join is not happening correctly.
If you are a very serious birder or an ornithologist, you might have an idea of what happened in this data frame. Since most of you are probably not, let’s try joining our summary data to our raw_call dataset by the number of collisions.
bird_crash %>%
filter(is.na(family)) %>%
select(sci_name, genus, species) %>%
group_by(sci_name,genus,species) %>%
summarize(collisions=n()) %>%
left_join(raw_call, by=c("collisions"), suffix = c(".raw_birds", ".raw_call"))
## `summarise()` has grouped output by 'sci_name', 'genus'. You can override using
## the `.groups` argument.
## Warning in left_join(., raw_call, by = c("collisions"), suffix = c(".raw_birds", : Each row in `x` is expected to match at most 1 row in `y`.
## i Row 1 of `x` matches multiple rows.
## i If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
## # A tibble: 4 x 10
## # Groups: sci_name [3]
## sci_n~1 genus~2 speci~3 colli~4 genus~5 speci~6 family fligh~7 habitat stratum
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Ammodr~ Ammodr~ henslo~ 9 Centro~ henslo~ Passe~ Yes Open Lower
## 2 Ammodr~ Ammodr~ henslo~ 9 Piranga rubra Cardi~ Yes Forest Upper
## 3 Ammodr~ Ammodr~ lecont~ 25 Ammosp~ lecont~ Passe~ Yes Open Lower
## 4 Ammodr~ Ammodr~ nelsoni 55 Ammosp~ nelsoni Passe~ Yes Open Lower
## # ... with abbreviated variable names 1: sci_name, 2: genus.raw_birds,
## # 3: species.raw_birds, 4: collisions, 5: genus.raw_call,
## # 6: species.raw_call, 7: flight_call
We can see that in the raw_call dataset, there is a species called Ammospiza leconteii that has the same number of collisions as our Ammodramus_leconteii, and similar for the other two problem species. This is because these species were reclassified out of the Ammodramus into a different genera, probably sometime in the middle of this long term experiment, and the only rows were not updated.
Let’s update our data block once more, to rename these species. The genera names in the raw_calls dataset are correct, so we need to update the raw_birds dataset.
Remember about the replace function from day 3, and about logical vectors from day 1. There is a hint below, if you are stuck
## Remove duplicate values from dataset (copy from above)
raw_light = raw_light %>% arrange(date) %>% distinct(date, .keep_all=TRUE)
## Remove duplicate values from dataset
## Update raw birds dataset to rename Ammodramus henslowii, Ammodramus leconteii, and Ammodramus nelsoni
raw_birds = raw_birds %>%
mutate(genus = replace(genus, genus == "Ammodramus" & species == "henslowii", "Centronyx")) %>%
mutate(genus = replace(genus, genus == "Ammodramus" & species == "leconteii", "Ammospiza")) %>%
mutate(genus = replace(genus, genus == "Ammodramus" & species == "nelsoni", "Ammospiza"))
## Update raw birds dataset to rename Ammodramus henslowii, Ammodramus leconteii, and Ammodramus nelsoni
raw_light <- raw_light %>% mutate(locality = "MP")
# add locality information to the light level data
bird_crash <- raw_birds %>%
left_join(raw_call, by=c("genus", "species")) %>%
# merge the raw collisions dataset (raw_birds) with the information about each bird species in the raw_call dataset
unite("sci_name", c("genus", "species"), remove = FALSE) %>%
# create a new variable, sci_name, with the Latin binomial (scientific name) of each species
# remove = FALSE means we want to keep the original genus and species columns as well as the new colum
left_join(raw_light, by=c("date", "locality"))
Hint: The code mutate(genus = replace(genus, genus == "Ammodramus" & (species == "leconteii" | species == "nelsoni"), "Ammospiza"))
updates every row of the input data where genus is Ammodramus and species is either leconteii or nelsoni with the new genus, Ammospiza. You can right something similar, with a different logical vector and replacement value, for
Now you probably want to scroll back and rerun our check code above, and verify there are no NAs in the family, flight_call, and other variables, and the only NAs are in the light_score column.
Now we should have a clean dataset to work with. For a lab notebook, we would probably want to consolidate all this cleaning code into a single code block at the beginning of our R markdown that we can run each time we start our analysis, so that we clean our data in the same precise way each time. Let’s do that here.
In the code block below, copy the commands to read in the raw data, and our final cleaning steps to remove duplicates from the raw_light dataset and update the genus names in the raw_birds dataset, as well as the merge command to join everything together. Now, when you want to work on this data, all you need to do is run the data cleaning code block below to get started. We might also want to create a setup
code block to load packages, and a functions
code block to define any custom functions we create.
Copy library() statements here to load all the packages you’ll need:
library(tidyverse)
library(janitor)
# Load required packages
Copy code for custom functions here. We’ve given you one that you can play with, which takes a date and returns the decade it is in:
as.decade <- function(year){
# Takes a date as input, converts it to an integer year, and then decade
# Returns a numeric value for the decade (*NOT* a date)
year_int <- as.integer(format(year, format="%Y"))
decade_part <- year_int %% 10
year_int - decade_part
}
Copy final data cleaning code here. It is a good idea to include any data checks in this code block as well.
raw_birds <- read_csv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/Chicago_collision_data.csv") %>%
janitor::clean_names()
## Rows: 69784 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): Genus, Species, Locality
## date (1): Date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw_light <- read_csv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/Light_levels_dryad.csv") %>%
janitor::clean_names()
## Rows: 3067 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (1): Light_Score
## date (1): Date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw_call <- read_csv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/bird_call.csv") %>% janitor::clean_names()
## Rows: 91 Columns: 7
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): Genus, Species, Family, FlightCall, Habitat, Stratum
## dbl (1): Collisions
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Read the data
####################
raw_light = raw_light %>% arrange(light_score) %>% distinct(date, .keep_all=TRUE)
# Remove duplicate dates from the raw_light dataset
#########
raw_birds = raw_birds %>%
mutate(genus = replace(genus, genus == "Ammodramus" & species == "henslowii", "Centronyx")) %>%
mutate(genus = replace(genus, genus == "Ammodramus" & species == "leconteii", "Ammospiza")) %>%
mutate(genus = replace(genus, genus == "Ammodramus" & species == "nelsoni", "Ammospiza"))
# Update raw_birds dataset to rename Ammodramus henslowii, Ammodramus leconteii, and Ammodramus nelsoni
#########
raw_light <- raw_light %>% mutate(locality = "MP")
# Add locality information to the light level data
#########
bird_crash <- raw_birds %>%
left_join(raw_call, by=c("genus", "species")) %>%
# merge the raw collisions dataset (raw_birds) with the information about each bird species in the raw_call dataset
unite("sci_name", c("genus", "species"), remove = FALSE) %>%
# create a new variable, sci_name, with the Latin binomial (scientific name) of each species
# remove = FALSE means we want to keep the original genus and species columns as well as the new column
left_join(raw_light, by=c("date", "locality")) %>%
# Merge in the raw_light dataset by date and locality
#filter(flight_call != "Rare") %>%
# Remove rows with "Rare" flight calls
filter(!is.na(light_score)) %>%
# Remove rows where light_score is NA
mutate(flight_call = replace(flight_call, flight_call == "Yes", "yes")) %>%
mutate(flight_call = replace(flight_call, flight_call == "No", "no")) %>%
mutate(flight_call = replace(flight_call, flight_call == "Rare", "no"))
# Convert flight_call column to lower case (to match paper legend)
##################
bird_crash %>%
add_count(sci_name, sort = TRUE) %>%
select(collisions, n) %>% distinct() %>% filter(collisions != n)
## # A tibble: 72 x 2
## collisions n
## <dbl> <int>
## 1 10133 1488
## 2 4910 1155
## 3 6303 986
## 4 5124 901
## 5 3729 529
## 6 2443 495
## 7 2029 425
## 8 4580 412
## 9 2515 306
## 10 1090 279
## # ... with 62 more rows
# We expect this to be empty, since we filter for rows where the number of collisions we count from the raw data does not equal the number the paper reported for that species
bird_crash %>% filter(is.na(family)) %>% select(sci_name) %>% group_by(sci_name) %>% summarize(obs=n())
## # A tibble: 0 x 2
## # ... with 2 variables: sci_name <chr>, obs <int>
# Check for NAs -- this should also be empty
You might combine these into a single code block, or keep them as three separate code blocks depending on your preference. Run these three code blocks to get ready for analysis.
Now, let’s attempt to replicate the analysis presenting in Figure 4A of the manuscript.
This is a challenge! Ask for help if you are stuck.
The figure legend is a bit confusing. The x axis should be the light_score, and the y axis should be the mean collisions per day, summed across species with that flight call type and averaged over all days. In the second panel, the x axis is the same, but the y axis is now mean crashes per day and per species.
To begin, you’ll need to make a summary dataset, since that is what we’ll plot. We want to group by flight_call and light_score (remove anything where light_score is NA first), and we need to get the mean collisions per day, which is the number of rows in our dataset (number of collisons) divided by the number of unique dates in our dataset (number of days). Remember the helper function n() returns the number of rows in the group, and n_distinct(column1) returns the number of unique values in column1, per group.
library(cowplot)
## Warning: package 'cowplot' was built under R version 4.1.3
####################
bartheme <- function () {
theme_classic() %+replace%
theme(text=element_text(family="serif", size=24),
axis.text=element_text(size=24),
axis.title=element_text(size=24),
axis.title.y=element_text(margin=margin(t=0,r=10,b=0,l=0),color="black",angle=90),
axis.title.x=element_text(margin=margin(t=10,r=0,b=0,l=0),color="black"),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size=0.75),
axis.ticks.length=unit(0.2,"cm"),
legend.position=c(0.2,0.9),
legend.key.width = unit(0.75, unit = "cm"),
legend.spacing.x = unit(0.25, 'cm'),
legend.text=element_text(size=24),
plot.margin = unit(c(1,1,1,1), "cm")
)
}
# A custom theme for the plots
####################
mean_col_by_light <- bird_crash %>%
#filter(locality == "MP", is.na(light_score) == FALSE) %>%
group_by(light_score, flight_call) %>%
# Group by vars of interest
summarize(num_col=n(), num_day=n_distinct(date), avg_col_day=num_col / num_day)
## `summarise()` has grouped output by 'light_score'. You can override using the
## `.groups` argument.
# Summarize stats for figure
# num_col: total number of rows (collisions) per grouping
# num_day: number of distinct days per grouping
# avg_col_day: the average number of collisions per day
#mean_col_by_light
a = ggplot(mean_col_by_light, aes(x=light_score, y=log(avg_col_day), fill=flight_call)) +
# Create the grob and add the data and groupings (fill)
# Log transform the y axis
geom_smooth(aes(color=flight_call), method="lm", fill="#d3d3d3") +
# Add the lines and style them:
# color colors by group since it is in aes()
# fill changes the color of the confidence interval area
# Do the lines before the points because in the original figure they are on top of the points
geom_point(size=4, shape=21, stroke=0.8, color="#000000") +
# Add the points
# points are colored based on the fill=flight_call in the aes() above
# shape, stroke, and color here just put the black border around the grey points
scale_color_manual(name="nocturnal\nflight call", values=c("#999999", "#000000")) +
scale_fill_manual(name="nocturnal\nflight call", values=c("#999999", "#000000")) +
# The colors for the lines (color) and points (fill)
xlab("") +
ylab("log mean collision counts") +
# Axis labels - the x-axis of this panel is simply read from the other panel since they are stacked
bartheme()
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## i Please use the `linewidth` argument instead.
# Add the custom theme as defined at the beginning of the block
#print(a)
# Plot panel A
##########
mean_col_by_spec <- bird_crash %>%
group_by(flight_call, date) %>%
mutate(col_by_sp_per_night = n() / n_distinct(sci_name)) %>%
# First calculate collisions per night for each species
group_by(flight_call, light_score) %>%
summarize(avg_col_spec=mean(col_by_sp_per_night))
## `summarise()` has grouped output by 'flight_call'. You can override using the
## `.groups` argument.
# Next calculate average collisions per night for each species by light score
#mean_col_by_spec
b = ggplot(mean_col_by_spec, aes(x=light_score, y=avg_col_spec, fill=flight_call)) +
# Create the grob and add the data and groupings (fill)
# Log transform the y axis
geom_smooth(aes(color=flight_call), method="lm", fill="#d3d3d3") +
# Add the lines and style them:
# color colors by group since it is in aes()
# fill changes the color of the confidence interval area
# Do the lines before the points because in the original figure they are on top of the points
geom_point(size=4, shape=21, stroke=0.8, color="#000000") +
# Add the points
# points are colored based on the fill=flight_call in the aes() above
# shape, stroke, and color here just put the black border around the grey points
scale_color_manual(name="nocturnal\nflight call", values=c("#999999", "#000000")) +
scale_fill_manual(name="nocturnal\nflight call", values=c("#999999", "#000000")) +
# The colors for the lines (color) and points (fill)
xlab("") +
ylab("log mean collisions per species") +
# Axis labels
bartheme() +
# Add the custom theme as defined at the beginning of the block
theme(legend.position="none")
# For panel B we don't want a duplicate legend to show up, so we turn it off here with another call to theme()
#print(b)
# Plot panel B
##########
fig = plot_grid(a, b, nrow=2, labels=c("(a)", "(b)"), label_fontfamily="serif", label_size=24, label_fontface="italic")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Combine the figure panels and style the labels
print(fig)
# Display the final figure
If there is time, what else could you explore in this data? For example, maybe you want to look at trends in crashes over years, or even decades. The function format(dates, format="%Y")
will return years from a date variable, and the function we gave you above will convert a date to a decade.
Use this code block, or several, to explore whatever seems interesting