Welcome to the fourth day of the FAS Informatics Intro to R workshop!

If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here.

Functions, Again

We’ve talked a lot about functions in this workshop so far. They take objects as arguments, do something, and return the output. So far, we have used functions that come with base R, or from packages we’ve installed. Today, we are going to learn how to write your own (simple) functions.

Why might you want to do this? There are a few major advantages.

The biggest is advantage comes if you use the same code more than once. Rather than having to copy and paste your code from one place to another, you can wrap it in a function and just call the function. This initially might take a little bit more work, but ultimately it will save you a ton of time and mistakes. For example, if you change something about the code, you only have to change it once, in the function, instead of multiple times. You can also easily apply your code to different inputs if you have it as a function. And it can be easier for collaborators and labmates to understand if your code is organized into functions – remember, your most frequent collaborator is you, six months in the future.

Even if you only use a piece of code once, it can be useful to encapsulate it into a function. It can help with readability, it makes maintaining your code easier, and it will make reusing your old code much, much easier if you can just grab a few functions from a different script. Eventually you might find yourself with a R script with a set of handy functions you routinely source at the beginning of your analysis markdown notebooks. No more trying to remember how you did that thing last month and spending another hour googling on Stackoverflow.

Writing your own functions may seem intimidating, but we’ll start slow. Remember in day 2, working with the vertebrates dataset, we used mutate to create a column that had the length Z score, that is the length_1_mm column mean centered and scaled by the standard deviation.

A simple function

Let’s create a function that takes a vector as input and returns a vector of the same length of the Z-normalized input vector.

Run this code block to create a simple function to create a normalized version of an input vector

z.normalize <- function(input) {
  input_mean <- mean(input)
  input_sd <- sd(input)
  input_Z <- (input - input_mean) / input_sd
  return(input_Z)
}

Let’s go over this slowly.

Functions are defined by assigning the output of the function operator to an object. You should see, for example, that z.normalize shows up in your Rstudio R environment as a function. The arguments that we will pass to our function are listed in the () after the function operator. We then use curly braces ({}) to define the code that will be run when we call the function. Lastly, we tell the function what to return, in this case the Z-transformed vector.

We can run functions we write just like any built-in or package function. Let’s try it.

Run this code block to create a test vector and run our new function on it

test_vec <- c(1,2,3,4,5,6,7,8,9,10)
z.normalize(test_vec)
##  [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446  0.1651446
##  [7]  0.4954337  0.8257228  1.1560120  1.4863011

Remember that a Z normalized list of values should have a mean of 0 and and standard deviation of 1. Let’s test it.

Run this code block to test our new function returns correctly Z normalized data

test_vec_Z <- z.normalize(test_vec)
mean(test_vec_Z)
## [1] 0
sd(test_vec_Z)
## [1] 1

A note about naming

It can be helpful to name your functions in such a way they are distinct from normal objects. For this workshop, we’ll use . in our function names and single words or _ in our objects. Another common approach is to use camelCase or TitleCase for functions and lowercase for objects. Note that Rstudio will helpfully group functions in their own section of your environment for you, but it is still handy to have a naming convention.

Default values

So far, we’ve just created a function that takes a numeric vector and normalizes it. What happens if there are missing values in our input?

Run the code block below to see what happens if the input has missing values

test_vec_incomplete <- c(1,2,3,4,5,6,NA,7,8,9,10)
z.normalize(test_vec_incomplete)
##  [1] NA NA NA NA NA NA NA NA NA NA NA

Uh oh, we have all NAs! We might immediately think, okay, let’s add na.rm=TRUE to our function, since that works for mean and sd.

Run the code block below and see what happens when we add na.rm=TRUE to our function.

z.normalize(test_vec_incomplete, na.rm=TRUE)

We need to modify our function so that it understands this argument.

Run the code block below to compute a normalized vector when there are missing values in the input

### Redefine the function
z.normalize <- function(input, removeNA = TRUE) {
  input_mean <- mean(input, na.rm = removeNA)
  input_sd <- sd(input, na.rm = removeNA)
  (input - input_mean) / input_sd
}

z.normalize(test_vec_incomplete)
##  [1] -1.4863011 -1.1560120 -0.8257228 -0.4954337 -0.1651446  0.1651446
##  [7]         NA  0.4954337  0.8257228  1.1560120  1.4863011
z.normalize(test_vec_incomplete, removeNA = FALSE)
##  [1] NA NA NA NA NA NA NA NA NA NA NA

In our new function definition, we’ve changed two things. First, we add another argument, removeNA. We assign this to TRUE in the function definition itself, which creates a default value. If we don’t include the removeNA argument in our function call, it will default to TRUE.

Second, we now are relying on the fact that R functions by default return the output of the last operation in the {} code block. So we don’t need to assign the output to a new object and return that object. Another of those programming shortcuts we’ve talked about so much already.

A side note about variable scope

You might worry that our function will break if we have another object called input in our R environment. Since the code block relies on the input object, if we have an object called input outside the code block, what will happen? The answer is that functions always look inside their own code block before they look at regular objects outside their code block. Let’s look at a simple function as an example.

Run the code block below to create a simple function that adds two numbers, adding 10 to the first argument by default

my.sum <- function(input_1, input_2 = 10) {
  input_1 + input_2
}

Exercise: What number do you think the following code block will return? Try to come up with answer before running it.

input_1 <- 20
my.sum(input_1 = 1, 3)
## [1] 4

What about this code block?

my.sum(input_2 = 3)

Documentation

You’ll notice if you try to run ?z.normalize, nothing comes up. It is definitely possible to create actual documentation for your own R functions, but it is a little complicated and beyond the scope of this workshop. Nonetheless, it is a great idea to document what your functions do. The easiest way to do this is to use comments (#) in the function definition.

Run the R code block below to rewrite the z.normalize function to include documentation.

z.normalize <- function(input, removeNA = TRUE) {
  ## takes a numeric input vector and returns a Z transformed version
  ## optional argument removeNA determines whether NAs are removed before calculating the mean and sd
  input_mean <- mean(input, na.rm = removeNA)
  input_sd <- sd(input, na.rm = removeNA)
  (input - input_mean) / input_sd
}

Finally, let’s put this function to use! Let’s rewrite our code from day 2 to use the z.normalize function. We’ll also use pipes here, so we’ll need to load the tidyverse package first.

# Set up
library(tidyverse)
vertebrates <- read_csv(file="https://harvardinformatics.github.io/workshops/2023-spring/r/data/LTER_andvertebrates.csv")

# Filter and compute Z normalized length
coastalZ <- vertebrates %>%
  filter(species=="Coastal giant salamander") %>%
  mutate(length_1_mm_Z = z.normalize(length_1_mm))

coastalZ
## # A tibble: 11,758 x 17
##     year sitecode section reach  pass unitnum unittype vert_in~1 pitnu~2 species
##    <dbl> <chr>    <chr>   <chr> <dbl>   <dbl> <chr>        <dbl>   <dbl> <chr>  
##  1  1993 MACKCC-L CC      L         1       1 P               16      NA Coasta~
##  2  1993 MACKCC-L CC      L         1       2 C               36      NA Coasta~
##  3  1993 MACKCC-L CC      L         1       2 C               37      NA Coasta~
##  4  1993 MACKCC-L CC      L         1       2 C               38      NA Coasta~
##  5  1993 MACKCC-L CC      L         1       2 C               39      NA Coasta~
##  6  1993 MACKCC-L CC      L         1       2 C               40      NA Coasta~
##  7  1993 MACKCC-L CC      L         1       2 C               41      NA Coasta~
##  8  1993 MACKCC-L CC      L         1       2 C               42      NA Coasta~
##  9  1993 MACKCC-L CC      L         1       2 C               43      NA Coasta~
## 10  1993 MACKCC-L CC      L         1       2 C               44      NA Coasta~
## # ... with 11,748 more rows, 7 more variables: length_1_mm <dbl>,
## #   length_2_mm <dbl>, weight_g <dbl>, clip <chr>, sampledate <date>,
## #   notes <chr>, length_1_mm_Z <dbl>, and abbreviated variable names
## #   1: vert_index, 2: pitnumber
# Summarise to check mean = 0 and sd = 1
summarise(coastalZ, mean_Z = mean(length_1_mm_Z,na.rm=TRUE), sd_Z = sd(length_1_mm_Z,na.rm=TRUE))
## # A tibble: 1 x 2
##      mean_Z  sd_Z
##       <dbl> <dbl>
## 1 -1.46e-16     1

One of the main benefits of functions is to make our code reusable. Now that we have the z.normalize function, we can apply it to any numeric vector.

Exercise: Modify the code block below to use the z.normalize function to filter out observations where the length is more than 1 standard deviation from the mean. Since the Z score represents the number of standard deviations from the mean, this means you want to keep observations with Z scores between -1 and 1

vertebrates %>%
  filter(species=="Coastal giant salamander") %>%
  filter(z.normalize(length_1_mm) < 1, z.normalize(length_1_mm) > -1)
## # A tibble: 8,647 x 16
##     year sitecode section reach  pass unitnum unittype vert_in~1 pitnu~2 species
##    <dbl> <chr>    <chr>   <chr> <dbl>   <dbl> <chr>        <dbl>   <dbl> <chr>  
##  1  1993 MACKCC-L CC      L         1       1 P               16      NA Coasta~
##  2  1993 MACKCC-L CC      L         1       2 C               37      NA Coasta~
##  3  1993 MACKCC-L CC      L         1       2 C               38      NA Coasta~
##  4  1993 MACKCC-L CC      L         1       2 C               39      NA Coasta~
##  5  1993 MACKCC-L CC      L         1       2 C               41      NA Coasta~
##  6  1993 MACKCC-L CC      L         1       2 C               42      NA Coasta~
##  7  1993 MACKCC-L CC      L         1       2 C               43      NA Coasta~
##  8  1993 MACKCC-L CC      L         1       2 C               45      NA Coasta~
##  9  1993 MACKCC-L CC      L         1       3 SC               7      NA Coasta~
## 10  1993 MACKCC-L CC      L         1       3 SC               8      NA Coasta~
## # ... with 8,637 more rows, 6 more variables: length_1_mm <dbl>,
## #   length_2_mm <dbl>, weight_g <dbl>, clip <chr>, sampledate <date>,
## #   notes <chr>, and abbreviated variable names 1: vert_index, 2: pitnumber

A more complicated function

Functions can also call other functions. So, for example, we could write a filter function that takes a numeric vector as its first argument, and a threshold in units of standard deviations as the second optional argument (with a default value of 2), and returns a logical vector that is true if the corresponding element of the numeric vector is less than the threshold standard deviations from the mean, and false otherwise.

Excercise: As an exercise, write a function called remove_outliers in the code block below that can do this. You should be able to call filter(vertebrates, species=="Coastal giant salamander", remove_outliers(length_1_mm, threshold=1)) and get the same result as the filter command above.

remove.outliers <- function(input_vec, threshold=2) {
  input_z <- z.normalize(input_vec)
  #input_z < threshold & input_z > -threshold
  
  abs(input_z) < threshold
}

filter(vertebrates, species=="Coastal giant salamander") %>%
  filter(remove.outliers(length_1_mm, threshold=1))
## # A tibble: 8,647 x 16
##     year sitecode section reach  pass unitnum unittype vert_in~1 pitnu~2 species
##    <dbl> <chr>    <chr>   <chr> <dbl>   <dbl> <chr>        <dbl>   <dbl> <chr>  
##  1  1993 MACKCC-L CC      L         1       1 P               16      NA Coasta~
##  2  1993 MACKCC-L CC      L         1       2 C               37      NA Coasta~
##  3  1993 MACKCC-L CC      L         1       2 C               38      NA Coasta~
##  4  1993 MACKCC-L CC      L         1       2 C               39      NA Coasta~
##  5  1993 MACKCC-L CC      L         1       2 C               41      NA Coasta~
##  6  1993 MACKCC-L CC      L         1       2 C               42      NA Coasta~
##  7  1993 MACKCC-L CC      L         1       2 C               43      NA Coasta~
##  8  1993 MACKCC-L CC      L         1       2 C               45      NA Coasta~
##  9  1993 MACKCC-L CC      L         1       3 SC               7      NA Coasta~
## 10  1993 MACKCC-L CC      L         1       3 SC               8      NA Coasta~
## # ... with 8,637 more rows, 6 more variables: length_1_mm <dbl>,
## #   length_2_mm <dbl>, weight_g <dbl>, clip <chr>, sampledate <date>,
## #   notes <chr>, and abbreviated variable names 1: vert_index, 2: pitnumber

Putting it all together

For the last part of the class, we’ll look at a real data analysis example in a fresh ‘computational lab notebook’ that will put together everything you’ve done so far.