Welcome to the third day of the FAS Informatics Intro to R workshop!
If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit and execute the code in this file. Please download the raw file here.
Conveying research findings in a meaningful way is an integral part to being a scientist, almost as important as performing research itself. The scale of modern biological data compounds its importance – we now need to summarize and condense millions of datapoints into meaningful, easy to understand figures while still showing the scale and variance in our data. All of which leads to the need for scientists to be able to programmatically generate visualizations for exploratory analysis, to convey conclustions, and for reproducibility of final figures.
In addition to statistical analysis, the R programming language and the RStudio IDE are built for this task, and while many people are able to generate figures with these programs, there can be a steep learning curve. The goal of today’s workshop is to introduce you how to use the skills you’ve learned in the previous workshops to manipulate data and visualize it in RStudio. We’ll touch a little bit on plots that can be generated by R itself (referred to as “base R”), but spend much of the time showing you the tidyverse plotting package, ggplot.
Let’s start with a brief example of R’s native plotting capabilities
(“base R”), which are powerful by themselves. In this code chunk, we
will load some pre-compiled datasets from the datasets
package and use minimal code to make some summary plots with them.
Run this code block to see some examples of base R plots
library(datasets)
# Load the datasets package, which contains several pre-compiled and clean datasets
# Type library(help="datasets") to see a full list of datasets provided
data(iris)
# This loads a dataset of flower data
par(mfrow=c(2,2))
# This tells R to setup a panel for 4 plots in 2 rows and 2 columns
plot(iris$Petal.Length, iris$Petal.Width)
# Plots two variables in the iris dataset as points on a scatter plot
petal_reg <- lm(iris$Petal.Width ~ iris$Petal.Length)
abline(petal_reg)
# This fits a basic linear regression model between two variables in the iris dataset
# and adds the regression line to the scatter plot
hist(iris$Sepal.Width)
# Plots a distribution of a variable in the iris dataset as a histogram
data(precip)
barplot(precip[1:20])
# Loads another dataset of precipitation in US cities and makes a barplot
# of the first 20 cities in the list. The *precip* data set represents a named vector: each numeric value for precipitation has, as its name, the city for which the precipitation data was recorded
data(nhtemp)
plot(nhtemp)
# Loads another dataset of temperatures over time and plots the time-series as
# a line graph
The block of code above demonstrates 4 basic plotting functions in base R:
plot()
, which takes two vectors of data and plots them
as x and y pointshist()
, which takes a single vector of data and bins
the data to plot the frequency of data in each bin as a histogram. This
is also known as plotting the distribution of the data.
Distributions can also be summarized visually as boxplots
(boxplot()
in base R).barplot()
, which takes categorical data, that is a
named vector or other named data structure, and plots raw
counts for each category (as opposed to counts of data
in bins in a histogram).plot()
again, but this time with time-series data (a
vector of points over time), which plots lines connecting the points
rather than the points themselves.While these plots in base R generally look good and convey a lot of information, they aren’t perfect. For example, the font size and placement on the x-axis of the barplot in the lower left hand panel mean only a few cities are labeled.
Furthermore, while these basic plots are easy to generate, more complex plots, say plotting two variables on the same panel, become more complicated with the syntax of base R. For these reasons, we will spend the rest of the time learning how to plot using the ggplot package.
ggplot is a package (library of code with various functions) that is part of the tidyverse. It uses a somewhat standardized ‘grammar of graphics’ (book; paper) in its syntax to make almost every aspect of a plot customizable. Using ggplot it is easy to make reproducible scientific figures that look nice and are easily understandable. With ggplot, I rarely need to tweak my figures outside of R.
This workshop is also heavily influenced by the book Fundamentals of Data Visualization by Claus Wilke.
Exercise: 1. Use the Packages tab of the panel in the lower left to install the
cowplot
package. We will use this later on to plot multi-panel figures.
- Run the code block below to look at the same plots as above, but using ggplot.
library(tidyverse)
# ggplot is part of the tidyverse package
library(cowplot)
# This package works with ggplot objects (graphical objects, or "grobs") to combine plots into multi-panel figures. Make sure you install it before running this code block.
# No need to load the datasets again since we've loaded them above.
scatter <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
geom_point() +
geom_smooth(method="lm")
# use ggplot to create a scatter plot with regression line of two variables in the iris
# dataset
# Save the grob as "scatter" for display later
dist <- ggplot(iris, aes(x=Sepal.Width)) +
geom_histogram(bins=10)
# Use ggplot to create a histogram of a variable in the iris dataset
# Save the grob as "dist" for display later
precip_df <- data.frame(as.list(precip[1:20]))
precip_df <- gather(precip_df)
names(precip_df) <- c("City", "Precipitation")
bar <- ggplot(precip_df, aes(x=City, y=Precipitation)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1))
# Use ggplot to create a barplot of precipitation in US cities
# Note that first we must convert the named vector to a data frame
# Save the grob as "bar" for display later
nhtemp_df <- data.frame(temp=as.matrix(nhtemp), date=time(nhtemp))
line <- ggplot(nhtemp_df, aes(x=date, y=temp)) +
geom_line()
# Use ggplot to crate a line graph of temperature over time
# Note that first we must convert the time series to a data frame
# Save the grob as "line" for display later
p <- plot_grid(scatter, dist, bar, line, ncol=2, nrow=2)
# Use plot_grid() to combine the 4 plots simply by listing their variable names
# Set the number of rows and columns in the panel at 2 each
print(p)
# Display the multi-panel figure
So these plots from ggplot look very similar to the base ones with some minor changes to the style.
Most importantly, ggplot takes as input data data frames or tibbles, which you learned about earlier.
Next we’ll outline the basic syntax of a ggplot plot and introduce two important aspects of them: aesthetics and layers.
A ggplot starts by defining a graphical object (or
grob) with the ggplot()
function and
telling it the source of the data for the plot, like this:
ggplot(data_frame)
So what happens if we do this with the iris
dataset?
Try loading the iris data into a ggplot object called
iris_plot
in the code block below and print it out. What happens?
ggplot(iris)
# Load the iris data into a ggplot grob and display (print) the grob
A blank grey rectangle should appear below the code block. Why?
Well, we’ve told ggplot the source of the data, but not which parts
of the data to use. In ggplot, which data is to be displayed and how it
is grouped and colored is referred to as the aesthetics
of the data, and is defined with the aes()
function. Each
ggplot plot needs at least one column of the input data frame to be
designated as the x-axis or y-axis in an aes()
call in
order to know how to plot the data. In many cases, both x and y will be
defined, but in some (like for histograms and boxplots) only one is
needed.
For example, if I had a data frame called data_frame
that had two columns called v1
and v2
that I
wanted to use in ggplot, I would do the following:
ggplot(data_frame, aes(x=v1, y=v2))
In the code block below, we add the variables
Petal.Length
and Petal.Width
as the x and y
aesthetics of the ggplot with the iris data and display it with
print()
.
Run the code block with the added aesthetics. What happens?
ggplot(iris, aes(x=Petal.Length, y=Petal.Width))
# Load the iris data into a ggplot grob AND specify the aesthetics and display the grob
Now you should see a grid on a blank grey rectangle. This time, ggplot knows the data to add to the plot so it appropriately labels the axes, but we still actually haven’t told it to plot the data. In other words, we haven’t added the layers that tell ggplot how to display the data to the plot yet. Should these be displayed as points? connected lines? something else? We haven’t specified yet, so it leaves all the data off.
In ggplot, we add layers using the plus sign
(+
). Each +
indicates a new layer in the
specified ggplot object and many layers can be added to a single plot in
order to display the data in different ways. Every aspect of a plot can
be controlled with different layers.
For reference, to see a table that summarizes different layers to display data, run the code block below. These layers cover the most basic ways to display, compare, and summarize data.
layer_table <- read.csv("https://harvardinformatics.github.io/workshops/2023-spring/r/tables/ggplot-layers.csv")
main_layers <- layer_table %>% filter(Category == "Main")
main_layers <- main_layers %>% select(!Category)
main_layers
# Read the table of layers, select the main layers and view
For reference, to see a table of some other useful layers, run the code block below.
layer_table <- read.csv("https://harvardinformatics.github.io/workshops/2023-spring/r/tables/ggplot-layers.csv")
second_layers <- layer_table %>% filter(Category == "Secondary")
second_layers <- second_layers %>% select(!Category)
second_layers
# Read the table of layers, select the secondary layers and view
Each layer itself is a function that can take
arguments (or
parameters/options) as input within
the ()
. For instance, if I wanted to change the size of the
points in geom_point()
, I could type:
plot_object <- ggplot(data_frame, aes(x=v1, y=v2)) +
geom_point(size=2)
Let’s use this information to now plot the points
for Petal.Length
and Petal.Width
that we have
specified as aesthetics for the iris data.
Run the following code block to add a geom_point() layer to the plot of Petal.Length and .Width:
iris_plot <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(size=2, alpha=0.5, color="darkgreen") +
## When prompted, add a layer of geom_smooth() here
geom_smooth(method="lm", se=FALSE, linetype="dashed", color="#333333")
## When prompted, add a layer of geom_smooth() here
# Load the iris data into a ggplot grob, specify the aesthetics, and
# add a layer that adds the points with geom_point()
print(iris_plot)
# Display the plot
Great! Now we can see our data points on the plot! Now, let’s take a
couple of minutes to adjust the geom_point()
layer to do
the following:
Exercise: 1. In the code block above, change the size of the points to 2 2. In the code block above, change the transparency (or alpha) of the points to 0.5 3. In the code block above, change the color of the points to “darkgreen”
What if we wanted to fit a model that describes the relationship
between the x and y variables? For that, ggplot has a
layer called geom_smooth()
.
Exercise: In the code block above, add anothar layer to the plot to fit a model to it with
geom_smooth()
. If you just add+ geom_smooth()
what happens? Is this a linear regression?
Now make the following changes to the geom_smooth()
layer:
Exercise: 1. In the code block above, change the method of the model fitting to “lm” (linear model) 2. In the code block above, remove the grey confidence interval by setting
se=FALSE
3. In the code block above, change the linetype to “dashed” 4. In the code block above, change the color of the line to the following hex code:"#333333"
So far, we’ve shown how modular ggplot is with its
layers, but we’ve only used these layers with a single
aesthetic (set of data). Each layer can actually act
independently on different aesthetics, simply by specifying the new data
set within the layer and adding an aes()
definition to it, like:
plot <- ggplot(data_frame, aes(x=v1, y=v2)) +
geom_point(size=2) +
geom_point(data=new_data_frame, aes(x=v3, y=v4))
As you can see, in the second geom_point()
layer we have
defined a new source of data and subsequently another set of
aesthetics with aes()
. In this example,
the first geom_point()
layer will plot variables
v1
and v2
from the data frame
data
as points on the grid, while the second
geom_point()
layer will plot variables v3
and
v4
from the new_data
data frame as points on
the grid. Note that it is not always necessary to declare a new data
frame if the one originally declared has the data you wish to plot in
the new layer as other columns in the data frame.
Let’s try this ourselves.
Exercise: In the code block below, add a new
geom_point()
layer to add points that plotSepal.Length
andSepal.Width
on the same grid asPetal.Length
andPetal.Width
:
iris_plot <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
geom_point(color="#e69f00") +
## When prompted, add a new geom_point() layer here with Sepal data
geom_point(aes(x=Sepal.Length, y=Sepal.Width), color="#56b4e9") +
## When prompted, add a new geom_point() layer here with Sepal data
## When prompted, add layers that change the labels of the x and y axes here
xlab("Length") +
ylab("Width")
## When prompted, add layers that change the labels of the x and y axes here
print(iris_plot)
# Display the plot
Do you notice any problems with the above plot? Let’s try to address them. Edit the code block to do the following:
Exercise: 1. In the code block above, change the color of the
geom_point()
layer that plots Petal.Length and Petal.Width to"#e69f00"
2. In the code block above, change the color of thegeom_point()
layer that plots Sepal.Length and Sepal.Width to"#56b4e9"
3. In the code block above, change the x-axis label (xlab()) to “Length” 4. In the code block above, change the y-axis label (ylab()) to “Width”
This at least allows us to discern the two different sets of points.
In general, for this type of plot, we would want our color to be an
aesthetic rather than defined within the layer so that
the colors are reflected in the legend. To do that,
move the color attribute out of the layer’s arguments
and into the aesthetic definition (aes()
)
instead.
Run the following code block to see how including the color as an aesthetic results in the addition of a legend to the plot:
iris_plot <- ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color="Petal")) +
geom_point() +
geom_point(aes(x=Sepal.Length, y=Sepal.Width, color="Sepal")) +
xlab("Length") +
ylab("Width") +
## When prompted, add a manual color scale layer here
scale_color_manual(labels=c("Petal", "Sepal"), values=c("#e69f00", "#56b4e9"), name="Flower part")
## When prompted, add a manual color scale layer here
# Demonstrating how to specify "color" as an aesthetic
print(iris_plot)
# Display the plot
This is a little better. We now have a legend, however that legend does not describe our variables very well (Petal and Sepal), and the colors do not match the hex codes we have defined. To address this, edit the code block above in the following way:
Exercise: 1. In the code block above, add a manual color scale layer as follows:
scale_color_manual(labels=c("Petal", "Sepal"), values=c("#e69f00", "#56b4e9"))
.
- In the code block above, change the color definition in the aesthetics to match the label of the manual color scale, e.g.
color="Petal"
instead ofcolor="#e69f00"
.
Great! Now the colors match the hex codes we wanted them to be and the labels in the legend are coherent.
- One final thing in the code block above, is to change the name of the legend by adding it to the manual color scale, e.g.
name="Flower part"
.
Later on, we will learn how to group and color data points by a third variable in a single data frame, rather than having separate layers with different aesthetics as we’ve done here.
For the next portion of the workshop, we’ll be constructing some more basic types of plots with ggplot, but this time with a genomic dataset that will need some filtering and manipulation with some of the tidyverse skills you learned in the last workshop.
Specifically, the data we’ll be looking at are structural variants in a small population of rhesus macaques. Rheseus macaques are small, Old-World monkeys that are widespread across southern and eastern Asia. We sequenced and are a common model organism for the study of human disease and primate evolution. We sequenced these genomes to study the evolution structural variation over different timescales.
We’ve provided the location in the reference macaque genome of the structural variants in our population of macaques in a tab-delimited file that contains genomic intervals (commonly known as a bed file).
Run the following code block to read the file containing the macaque structural variant calls (SVs) into our R environment as a data frame and take a look at a few lines of the data:
macaque_svs <- read_tsv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/macaque-svs-filtered.bed", col_names=c("chr", "start", "end", "sv"))
# Read the data containing macaque SV calls as a data frame
macaque_svs
## # A tibble: 3,646 x 4
## chr start end sv
## <chr> <dbl> <dbl> <chr>
## 1 chr1 89943 90471 chr1:89943:<DUP>:528:1907.19
## 2 chr1 130740 131675 chr1:130740:<DEL>:935:285.63
## 3 chr1 218574 219534 chr1:218574:<DUP>:960:5699.01
## 4 chr1 219608 220078 chr1:219608:<DUP>:470:2074.69
## 5 chr1 519434 541582 chr1:519434:<DUP>:22148:1673.64
## 6 chr1 519473 542033 chr1:519473:<DUP>:22560:2560.16
## 7 chr1 520173 541800 chr1:520173:<DEL>:21627:2955.11
## 8 chr1 525401 525806 chr1:525401:<DEL>:405:2986.21
## 9 chr1 541132 590572 chr1:541132:<DEL>:49440:316.41
## 10 chr1 552968 582234 chr1:552968:<DUP>:29266:189.32
## # ... with 3,636 more rows
# Display the preview of the data frame
This shows a relatively simply structured dataset, with the first
column indicating the scaffold of the structural variant, the second
showing the starting coordinate on that chromosome and the third showing
the end coordinate. These three columns are typical of a bed file. The
fourth column is a bit of a catchall (like many last columns in
bioinformatics file formats are) that provides an attempt to give a
unique ID for each structural variant. We’ll talk about this column more
in a bit, but do not that within this column we see the TYPE of
structural variant coded as for deletion.
Exercise: How many chromosomes do macaques have?
Let’s first look at the shape of the distribution of SV lengths in our dataset using a histogram. Histograms take a set of data and partition them into bins. The number of data points in each bin is then counted and a bar with a height representing that count is placed at that bin’s label.
To make a histogram of lengths we’ll first need to add a column with the length of each SV to the dataset.
Exercise: In the code block below, use
mutate()
to add the length of each SV as another column in the macaque_svs data frame. Call this column ‘length’. What is the average length of all structural variants in the dataset?
## When prompted, add a column to the macaque_svs data frame that contains the length of each SV
## Call this column 'length'
macaque_svs = macaque_svs %>% mutate(length = end - start)
## When prompted, add a column to the macaque_svs data frame that contains the length of each SV
## Call this column 'length'
mean(macaque_svs$length)
## [1] 3615.021
Now, let’s make a histogram to visualize the shape of the distribution of lengths.
Run the following block of code to visualize the distribution of SV lengths as a histogram:
## When prompted, calculated the mean SV length as mean_sv_len
mean_sv_len <- macaque_svs %>% filter(length <= 5000) %>% pull(length) %>% mean()
## When prompted, calculated the mean SV length as mean_sv_len
sv_len_hist <- macaque_svs %>%
mutate(length = replace(length, length > 5000, 5000)) %>%
ggplot(aes(x=length)) +
geom_histogram(color="#999999") +
## When prompted, add a layer with geom_vline() at the mean SV length
geom_vline(xintercept=mean_sv_len, linewidth=1, linetype="dotted", color="#920000") +
## When prompted, add a layer with geom_vline() at the mean SV length
## When prompted, add a layer with annotate() to display the mean SV length
annotate("text", x=mean_sv_len+300, y=600, label=signif(mean_sv_len, 6)) +
## When prompted, add a layer with annotate() to display the mean SV length
## When prompted, add a layer with scale_y_continuous()
scale_y_continuous(expand=c(0,0)) +
## When prompted, add a layer with scale_y_continuous()
## When prompted, add a layer with ylab() to change the title of the y-axis
ylab("# of SVs")
## When prompted, add a layer with ylab() to change the title of the y-axis
print(sv_len_hist)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Display the plot
What are some things we see about this distribution?
By default, geom_histogram()
partitions the data into 30
bins, however we can customize this with the
bins=
option within that layer.
Exercise: In the block of code above, change the number of bins in the histogram above. How does this change how we view the distribution?
In order to more accurately see the shape of the distribution, use
mutate to replace any SV with a length > 5000 with
5000. Be careful not to change the original data! Hint - the
replace()
function takes a vector, a logical index, and a
value to replace. So
replace(length, length > 5000, 5000)
replaces every
value of the first vector where the second argument is TRUE
with the third argument.
Exercise: In the block of code above, add a line using the
mutate()
function to replace SVs shorter than 5000bp from themacaque_svs
data frame with a value of 5000 and re-generate the plot.
Now we can see that most of the SVs are shorter than 2000bp and many are even shorter than 1000bp.
In addition to layers like geom_line()
,
which requires x and y coordinates, there are also
layers like geom_hline()
,
geom_vline()
, and geom_abline()
that let us
add horizontal, vertical, and custom lines to the plot. Let’s add a
layer that shows a vertical line at the average SV size
using geom_vline()
as follows. You’ll need to define the
mean first, since we want to exclude outliers (SVs > 5000 bp) from
the mean calculation.
Exercise: 1. Calculate the mean SV length in the macaque dataset and stored it as an object called
mean_sv_len
. 2. Add the following layer to thesv_len_hist
object in the code block above and re-generate the plot. Customize the look of the line any way you want (e.g. withcolor=
,linetype=
,linewidth=
, etc.)
geom_vline(xintercept=mean_sv_len)
We can also add text as a layer to the plot. If we had text to add
for every point, we may use something like geom_text()
or
geom_label()
to layer it on. However, for now let’s add
only one bit of text to this plot: the average SV length, next to the
line we added. To do this, we can use an annotate()
layer
and directly provide the text and the x and y coordinates at which to
display it. This can take a little trial and error to get right, so I’ve
already put it in the right spot (for my display).
Add this layer to the
sv_len_hist
object in the code above and re-generate the plot:
annotate("text", x=mean_sv_len+300, y=600, label=signif(mean_sv_len, 6))
There are some other things we can do to improve this plot:
Exercise: 1. I personally dislike the space between the bars of the histogram and the x-axis. We can remove this by changing the y-axis. Add the following layer to the histogram above to do so:
scale_y_continuous(expand=c(0,0))
- I also like to see the boundaries between the histogram bins (bars). Add a
color=
to the histogram layer to see them.
- Provide a more descriptive y-axis label with a
ylab()
layer.
The histogram allows us to clearly see the shape of the distribution of interest, and we’ve added some nice annotations to the plot. But the best way to summarize a distribution is with a boxplot. Boxplots are also a good way to visualize shifts in multiple distributions, since they can easily be placed next to each other, unlike a histogram which uses both axes to plot data. In this case, let’s filter out the SVs longer than 5000 bp first, using the filter funciton,.
Generate the boxplot of SV lengths in the code block below:
sv_len_boxplot <- macaque_svs %>%
filter(length <= 5000) %>%
ggplot(aes(x=0, y=length)) +
geom_boxplot(width=0.5) +
scale_x_continuous(limits=c(-1,1)) +
# For the box plot, to prevent the box from spanning the width of the grid we specify x=0 in the aesthetics and set the limits to the x-axis here
## When prompted, add a layer with geom_point() that shows a single point at the mean SV length
geom_point(aes(x=0, y=mean(length)), size=3, color="blue")
## When prompted, add a layer with geom_point() that shows a single point at the mean SV length
print(sv_len_boxplot)
# Display the plot
The boxplot shows the median length as the line in the box, the first
and third quartiles as the top and bottom of the box, respectively, and
the min and max values (excluding outliers) at the end of the line. With
so many outliers, my guess is that the mean SV lengths is larger than
the median. Let’s add a geom_point()
layer
with a single point for the mean SV length. This is similar to adding
the line to the histogram, but slightly more complicated for a couple of
reasons:
Add the following layer to the boxplot in the code block above to add a point for the mean SV length:
geom_point(data=data.frame(), aes(x=0, y=mean_sv_len), size=3, color="blue")
Is the mean larger than the median like we expected?
In general, medians are more robust statistics to summarize a distribution than means since they are less skewed by outliers.
In addition to plotting x and/or y data points, ggplot allows us to group data points by a third variable and plot them in distinct ways (e.g. with different colors) in order to visualize them separately. This is similar to what we did manually with the Sepal and Petal points in our scatter plot above, but ggplot does this natively as long as another variable exists in the data frame that indicates the desired groupings.
Let’s say we want to know if the length of deletions differs from that of duplications. Recall that the fourth column of our data encodes a lot of information about the SV in a string separated by colons. Some of this is redundant with information we already have, but some of it is unique and useful. Specifically, in the 4th column we have:
chromosome;start coordinate;SV type;length;score
So this has the information we want (SV type), but not in a format
that is easy to group by. To get this information out we can
separate()
it. Below shows how we can use
separate()
to add columns to our data based on strings in
existing columns and an example of how to plot the distributions of the
length of these two types of SVs. We’ll also filter out SVs longer than
5000 bp.
Run the code block below to separate the information in the 4th column of our data frame into individual columns of interest:
macaque_svs_sep <- macaque_svs %>%
filter(length <= 5000) %>%
separate(sv, c(NA, NA, "type", NA, "score"), sep=":", remove=FALSE)
# Separate the information in the sv column of macaque_svs intro separate columns
# - Split whenever a colon (:) is seen
# - Keep only the third and fifth splits, and name the new columns "type" and "score"
macaque_svs_sep
## # A tibble: 3,227 x 7
## chr start end sv type score length
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 chr1 89943 90471 chr1:89943:<DUP>:528:1907.19 <DUP> 1907.19 528
## 2 chr1 130740 131675 chr1:130740:<DEL>:935:285.63 <DEL> 285.63 935
## 3 chr1 218574 219534 chr1:218574:<DUP>:960:5699.01 <DUP> 5699.01 960
## 4 chr1 219608 220078 chr1:219608:<DUP>:470:2074.69 <DUP> 2074.69 470
## 5 chr1 525401 525806 chr1:525401:<DEL>:405:2986.21 <DEL> 2986.21 405
## 6 chr1 766381 766933 chr1:766381:<DEL>:552:5099.0 <DEL> 5099.0 552
## 7 chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55 <DEL> 201.55 4326
## 8 chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32 <DEL> 11284.32 2676
## 9 chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03 <DEL> 15253.03 1196
## 10 chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53 <DEL> 1112.53 1145
## # ... with 3,217 more rows
# Display the preview of the new data frame
So we can see that there are now two more columns in our data frame: type, which tells us whether the SV is a deletion (<DEL>) or duplication (<DUP>), and score which is just how the program that inferred the SVs scored the call.
Now that we have this information, we want to construct a boxplot with ggplot that uses this new data to summarize the distribution of length of the SVs and group the SVs on the x-axis by type. In the case of a boxplot, we simply add the group variable as the x aesthetic.
Run the code block below to generate boxplots summarizing the distribution of SV lengths for both types of SVs:
macaque_svs_sep %>%
ggplot(aes(x=type, y=length)) +
geom_boxplot(width=0.5) +
theme(axis.text.x=element_text(angle=25, hjust=1))
# Piping the macaque SV data with SV type as a column to ggplot to
# generate boxplots of SV length by type
We can group by any variable we like in our input data, by using it as the x aesthetic. For example:
Run the code block below to generate a plot with multiple boxplots, 1 per chromosome:
## When prompted, group the macaque SVs by chromosome and summarize the mean length for each
macaque_svs_chr <- macaque_svs_sep %>%
group_by(chr) %>%
summarize(mean.len=mean(length))
## When prompted, group the macaque SVs by chromosome and summarize the mean length for each
sv_len_boxplot <- macaque_svs_sep %>% ggplot(aes(x=chr, y=length)) +
geom_boxplot(width=0.5) +
## When prompted, add a layer from the grouped chromosome data that displays a point for the mean in the box of each chromosome
geom_point(data=macaque_svs_chr, aes(x=chr, y=mean.len), color="blue") +
## When prompted, add a layer from the grouped chromosome data that displays a point for the mean in the box of each chromosome
theme(axis.text.x=element_text(angle=25, hjust=1))
# This theme layer adjusts the angle of the labels on the x-axis - we'll learn more about theme() later
print(sv_len_boxplot)
# Display the plot
Grouping like this changes the x-axis from a continuous to a discrete scale. Here we can see that, for the most part (with the exception of the Y chromosome), the length of the SVs share a similar distribution between chromosomes.
summarize()
Previously, we showed you how the summary()
function
will print out basic summary statistics for all columns in a data frame.
Today, we introduce the function summarise()
for generating
specific statistics on one or more variables.
Run the code block below to count the number of chromosomes in the macaque dataset:
macaque_svs_sep %>% summarize(n=n_distinct(chr))
## # A tibble: 1 x 1
## n
## <int>
## 1 22
# Summarize the macaque dataset by counting the number of distinct entries in the chr column
This function call returns a data frame with 1 column (named “n”) and 1 row.
We could also use this to generate more than one summary statistic, resulting in a new data frame with 1 row and multiple columns, and with other functions as well.
Run the code block below to generate a summary of the counts of chromosomes and the average SV length:
macaque_svs_sep %>% summarize(n=n_distinct(chr), mean_len=mean(length, na.rm=TRUE))
## # A tibble: 1 x 2
## n mean_len
## <int> <dbl>
## 1 22 1041.
# A summarize command that performs multiple summaries
In this way we can summarize our dataset as a whole in a single command.
group_by()
In most data sets, summarizing across all observations (rows) may not
be particularly informative, because what one usually wants to do is to
get information specific to different groups of observations, e.g. those
in different habitats, exposed to different experimental treatments, on
different chromosomes, etc. Thus, in most circumstances one will want to
define those groups for a tibble. One can do this using the
group_by()
function, which takes as it’s first argument the
name of the data frame, and the following arguments are variables to
group by; these can then be followed by other keyword arguments). An
obvious grouping for the sv_macaques
tibble is by
chromosome.
Run the code block below to group the macaque dataset by chromosome:
by_chr <- group_by(macaque_svs_sep, chr)
# A group_by command to group the macaque SVs by chromosome
by_chr
## # A tibble: 3,227 x 7
## # Groups: chr [22]
## chr start end sv type score length
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
## 1 chr1 89943 90471 chr1:89943:<DUP>:528:1907.19 <DUP> 1907.19 528
## 2 chr1 130740 131675 chr1:130740:<DEL>:935:285.63 <DEL> 285.63 935
## 3 chr1 218574 219534 chr1:218574:<DUP>:960:5699.01 <DUP> 5699.01 960
## 4 chr1 219608 220078 chr1:219608:<DUP>:470:2074.69 <DUP> 2074.69 470
## 5 chr1 525401 525806 chr1:525401:<DEL>:405:2986.21 <DEL> 2986.21 405
## 6 chr1 766381 766933 chr1:766381:<DEL>:552:5099.0 <DEL> 5099.0 552
## 7 chr1 1117696 1122022 chr1:1117696:<DEL>:4326:201.55 <DEL> 201.55 4326
## 8 chr1 1151866 1154542 chr1:1151866:<DEL>:2676:11284.32 <DEL> 11284.32 2676
## 9 chr1 1166390 1167586 chr1:1166390:<DEL>:1196:15253.03 <DEL> 15253.03 1196
## 10 chr1 1408621 1409766 chr1:1408621:<DEL>:1145:1112.53 <DEL> 1112.53 1145
## # ... with 3,217 more rows
# Display the grouped data frame
Notice that in the header of the by_chr
tibble there is
now an indication that there are 22 groups. It is important to note that
group_by()
does not change the data in the tibble: there is
no column that specifies the groups. Most commonly, the
group_by()
function is combined with other functions (via
piping) to build datasets by group. For instance, combining
group_by()
and summarize()
gives us a lot of
flexibility.
Run the code block below to calculate the number of SVs per chromosome:
macaque_svs_sep %>% group_by(chr) %>% summarize(num_svs=n_distinct(sv))
## # A tibble: 22 x 2
## chr num_svs
## <chr> <int>
## 1 chr1 216
## 2 chr10 135
## 3 chr11 140
## 4 chr12 151
## 5 chr13 141
## 6 chr14 128
## 7 chr15 132
## 8 chr16 110
## 9 chr17 153
## 10 chr18 84
## # ... with 12 more rows
# A piped example using both group_by() and summarize()
# Alternatively, just use n()
What if we want to add a layer of points to the boxplot above that indicates the mean SV lengths for each chromosome? In this case, we’ll need to summarize the data per chromosome as a new data frame. To do this, do the following:
Exercise: 1. In the code block above where we generated the boxplot of SV length by chromosome (
sv-len-boxplot-chr
), create a new data frame calledmacaque_svs_chr
that has two columns: the chromosome label and the average SV length on that chromosome.
- Add a layer with
geom_point()
that uses this data frame on a discrete x-axis scale by chromosome that shows the mean length as a point. Style these points any way you like.
So far we’ve used histograms to look at the shape of
distributions and boxplots to summarize them. But we
can do both in a single visualization. There are many ways to do this,
but one way is to use a violin plot with
geom_violin()
in combination with
geom_boxplot()
. Violin plots are similar to
density plots, but flipped on their side and
mirrored.
Run the code block below to take the distributions of SV length for DELs and DUPs that we created above and convert it into a violin plot:
sv_len_violin_type <- ggplot(macaque_svs_sep, aes(x=type, y=length)) +
geom_violin() +
## Uncomment when prompted
geom_boxplot(width=0.1, outlier.shape=NA) +
## Uncomment when prompted
theme(axis.text.x=element_text(angle=25, hjust=1))
# Construct a violin plot with boxplot to summarize and show shape of SV length distributions by type
print(sv_len_violin_type)
# Display the plot
This is an interesting way to show the shape of each distribution next to each other. But we can also summarize the distribution as well with a boxplot to give us even more information that’s easy to understand within this single plot.
In the code block above, uncomment the line that adds the boxplot to see both the shape and the summary of the distribution.
One benefit of ggplot’s layering syntax is that layers can be added
to a plot later on, not just when defining the graphical object (or
grob). For instance, if we wanted to add points for the
mean length of DELs and DUPs to the above violin plot we could just take
the grob as we’ve named it (sv_len_violin_type
) and add
onto it with +
.
Run the code block below to add points representing the mean SV length for each SV type to the violin plot we already generated:
macaque_svs_type <- macaque_svs_sep %>%
group_by(type) %>%
summarize(mean.len=mean(length))
# Group the SVs and calculate the mean length for each type
sv_len_violin_type_w_means <- sv_len_violin_type + geom_point(data=macaque_svs_type, aes(x=type, y=mean.len), size=3, color="blue")
# Add a layer to the previous violin plot that adds points that indicate the mean length of each SV type
print(sv_len_violin_type_w_means)
# Display the plot
Another basic type of plot that is useful for comparing cumulative counts, amounts, or sums is the barplot.
Run the code block below to make a barplot of the total count of SVs per chromosome:
sv_count_bar_chr <- ggplot(macaque_svs_sep, aes(x=chr)) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=25, hjust=1))
# A barplot of the counts of SVs per chromosome
print(sv_count_bar_chr)
# Display the plot
From this we can easily see differences in the number of SVs per
chromosome. Notice that even though we provided the whole data frame,
which doesn’t contain counts per chromosome, we’ve been able to
summarize by count within the ggplot call itself by using the
stat="count"
parameter in the geom_bar()
layer.
A way to make this look nicer would be to order the bars by counts so that the bars with the highest counts appear first when reading the plot from left to right. In order to do this we’ll have to do two things:
Run the code block below to re-roder the bars in the barplot from most SVs to least SVs (highest bars to lowest bars):
macaque_svs_chr_count <- macaque_svs_sep %>%
group_by(chr) %>%
mutate(count=n())
# Count the number of SVs per chromosome
sv_count_bar_chr_ordered <- ggplot(macaque_svs_chr_count, aes(x=reorder(chr,-count))) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=25, hjust=1))
# A barplot of the counts of SVs per chromosome
print(sv_count_bar_chr_ordered)
# Display the plot
One may also want to instead order by chromosome number. This is
slightly more difficult because we can’t use a built-in R function like
reorder()
– the chromosomes, being a mix of characters and
integers, are already sorted in terms of alphabetical order (i.e. chr1
followed by chr10 instead of chr2). Instead, we will have to manually
convert the chr
column to a factor and specify the orders
of the levels.
Run the code block below to re-order the bars by chromosome number
macaque_svs_sep <- macaque_svs_sep %>%
mutate(chr = factor(chr, levels=c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chrX", "chrY")))
# Convert the chr column from to a factor and order the levels of the factor manually
# This ordering will be reflected in the orders of the bars when plotted
sv_count_bar_chr_ordered <- ggplot(macaque_svs_sep, aes(x=chr, fill=type)) +
geom_bar(stat="count", color="#333333") +
theme(axis.text.x=element_text(angle=25, hjust=1))
# A barplot of the counts of SVs per chromosome
print(sv_count_bar_chr_ordered)
# Display the plot
We could also do this by removing the “chr” string from each SV label and converting to numeric.
Again, it may be useful to visualize multiple aspects of our data by
grouping different variables. With barplots there are two ways to do
this, either by stacking bars for different groups or
by dodging bars (placing bars for different groups next
to each other). geom_bar()
’s default behavior when given a
grouping variable is to stack the bars.
Exercise: In the code block above, add a
group=type
parameter to the aesthetic in order to divide each bar by SV type.
What happens? Nothing, seemingly. This is because we’ve essentially stacked two bars that look exactly the same on top of each other. For stacked barplots we need another way to distinguish the groupings: color.
ggplot aesthetics usually have two ways to be colored, either with
color=
or fill=
. color refers
to the color of the line that defines the border of a shape while
fill refers to the color of the inside of the shape.
Let’s see the difference and see how stacked bar plots look:
Exercise: 1. In the code block above, replace
group=type
withcolor=type
in the aesthetics of the barplot above.
What happens? Now we can see a difference between the two types of SV calls for each chromosome, though not very distinctly.
- In the code block above, now instead of
color=type
, usefill=type
. These differences are more distinct now.- In the code block above, both
fill
andcolor
can be defined for a given aesthetic. Addcolor="#333333"
to thegeom_bar()
layer of the barplot to see. By adding this to the layer rather than the aesthetics we have specified to color all bars the same way, regardless of grouping.
We can also use stacked barplots to compare proportions of amounts in
groups. One simply has to make a single adjustment, in which one
specifies position="fill"
. It does seem a bit odd to
specify one of the dimensions of the mapping definition fill in
quotes … but it results in some ggplot magic!
Run the code block below to show the same bar plot now expressed as proportions
sv_count_bar_chr_prop <- ggplot(data=macaque_svs_sep, aes(x=chr, fill=type)) +
geom_bar(position="fill", color="#333333") +
theme(axis.text.x=element_text(angle=25, hjust=1)) +
## When prompted, add a vertical line at x=0.10 with geom_vline()
geom_hline(yintercept=0.10, linetype="dotted", color="#333333") +
## When prompted, add a vertical line at x=0.10 with geom_vline()
theme(axis.text.x=element_text(angle=25, hjust=1))
print(sv_count_bar_chr_prop)
Here we can see that the SVs from this macaque sample are predominantly deletions. But how extreme is this?
Exercise: In the code block above add a horizontal line at x=0.10 with geom_vline() to see how many chromosomes are over 90% deletions.
Exercise: In the code block below, re-write all of the commands leading up to generating the bar plot in the code block named
sv-count-barplot-chr
as a single piped command. This command should do the following: 1. Read the data from a file 2. Add a column representing the length of each SV 3. Filter out SVs longer than 5000bp or on the X or Y chromosomes 4. Separates the SV column to add a column of SV type 5. Plots a barplot of number of SVs grouped by chr Don’t save any data to named objects.
## Write a single command to do all the steps to generate the barplot above without saving any objects
read_tsv("https://harvardinformatics.github.io/workshops/2023-spring/r/data/macaque-svs-filtered.bed", col_names=c("chr", "start", "end", "sv")) %>%
mutate(length=end-start) %>%
filter(length < 5000 & !chr == "chrX" & !chr == "chrY") %>%
separate(sv, c(NA, NA, "type", NA, "score"), sep=":", remove=FALSE) %>%
mutate(chr = factor(chr, levels=c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chrX", "chrY"))) %>%
ggplot(aes(x=chr, fill=type)) +
geom_bar(stat="count") +
theme(axis.text.x=element_text(angle=25, hjust=1))
## Rows: 3646 Columns: 4
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (2): chr, sv
## dbl (2): start, end
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#group_by(chr) %>%
#summarize(count=n()) %>%
#ggplot(aes(x=chr, y=count, fill=type)) +
# geom_bar(stat="identity") +
# theme(axis.text.x=element_text(angle=25, hjust=1))
## Write a single command to do all the steps to generate the barplot above without saving any objects
Thanks for joining! Let us know if you have any questions!
There is a lot of material to discuss about plotting. We’ve included some extra material here you can look at after the class today.
ggplot also provides an easy way to compare groups in different
panels called facetting. facet_wrap()
and
facet_grid()
allow you to split your plot into multiple
panels by another group variable. Note that this way of creating a
multi-panel figure is quite different from the split_grid()
function we demonstrated at the beginning of this workshop.
split_grid()
aggregates plots generated by separate ggplots
into multiple panels, which typically are comprised of
different (if related) data. In contrast
facet_wrap()
, splits the same overall data set
into facets based upon grouping variables that subdivide the data so you
can compare and contrast patterns in different groups (or pairs of
groupings).
Run the code block below to split the violin plots showing the distribution of lengths in SV types by chromosome with
facet_wrap()
:
## When prompted, group the svs by both "chr" and "type" and calculate the mean length
macaque_svs_chr_type <- macaque_svs_sep %>%
group_by(chr, type) %>%
summarize(mean.len=mean(length))
## `summarise()` has grouped output by 'chr'. You can override using the `.groups`
## argument.
## When prompted, group the svs by both "chr" and "type" and calculate the mean length
sv_len_violin_type_facet <- sv_len_violin_type +
## When prompted, add a layer with geom_point() that shows the mean length of each type
geom_point(data=macaque_svs_chr_type, aes(x=type, y=mean.len), size=1, color="blue") +
## When prompted, add a layer with geom_point() that shows the mean length of each type
facet_wrap(~chr)
# From the original violin plot that groups by type, add the points for mean length
# and facet_wrap() by chr
print(sv_len_violin_type_facet)
# Display the plot
So now we’ve essentially grouped our data by two variables: SV type and chromosome. SV type shows up as different violin plots within each panel, and the chromosomes are split between panels. This is an easy way to generate multi-panel figures, but only works for figures of the same type with groupings in your data. We’ll cover more complex multi-panel figures (of different types with different data) later on.
How would one add the points for means to each violin plot on this
facetted figure? Similar to above, we can use group_by()
and summarize()
, but this time we will group by both of our
variables, chr
and type
.
Exercise: 1. In the code block above, before generating the ggplot, write a line of code to summarize the data by
chr
ANDtype
. Save this information in a new data frame calledmacaque_svs_chr_type
2. Add a layer to the plot that adds points representing the mean to the figure.
Colors on graphs can not only make them look nice, but also convey
important information, such as differences between groups or the extent
to which values vary (e.g. heatmaps). ggplot allows us to color both
continuous and discrete variables, the
main difference being that continuous variables are
colored by gradients and discrete ones are colored with
individual colors. We also know that aesthetics can have both their
border colored with color
and their shape colored with
fill
.
Let’s review how we colored our scatter plot of iris data above.
iris-plot-4
, we have added two
layers of geom_point()
to plot two
different groups from the data: Sepal size and Petal size. Here we have
colored each layer separately with a
color=
parameteriris-plot-5
, we have colored our two
layers by defining their color group in the aesthetics.
We then added a manual color layer with
scale_color_manual()
to define the colors and how they
should appear in the legend.In both cases, we have a discrete color scale.
Let’s take another look at the boxplot we generated showing the distribution of SV lengths of both types (DEL and DUP).
Run the code block below to re-generate the violin plot from above:
sv_len_violin_type_clr <- ggplot(macaque_svs_sep, aes(x=type, y=length, fill=type)) +
geom_violin() +
geom_boxplot(width=0.1, outlier.shape=NA, color="#ececec", show.legend=FALSE) +
## When prompted, add a scale_fill_manual() layer
scale_fill_manual(values=c("#920000", "#004949")) +
## When prompted, add a scale_fill_manual() layer
theme(axis.text.x=element_text(angle=25, hjust=1))
# Construct a violin plot with boxplot to summarize and show shape of SV length distributions by type
print(sv_len_violin_type_clr)
# Display the plot
Now let’s add some color:
Exercise: In the code block above, add a
color=type
definition to the aesthetics of the plot and regenerate the figure. In the code block above, now change it tofill=type
. In the code block above, now add ascale_fill_manual(values=c("#920000", "#004949"))
layer. This defines two fill colors, one for each SV type. Use the scatter plot iniris-plot-5
for reference.
What is one problem with this plot the way it is?
Colors of different layers can be defined independently. If added to the layer directly, no distinction is made between groups.
In the code block above, add a
color="#ececec"
parameter to thegeom_boxplot()
layer.
So far, we’ve looked at coloring discrete variables: groups of SV types or chromosomes. However, we can also color continuous variables, which are numbers that could take any value.
To get a sense for this, let’s look at a scatterplot of average SV length vs. chromosome length for each chromosome in the macaque genome. First, we’ll have to read in a table of chromosome lengths for the macaque reference genome.
Run the code block below to read in a data table of chromosome lengths for the macaque reference genome:
macaque_chr_lens <- read.table("https://harvardinformatics.github.io/workshops/2023-spring/r/data/rheMac8.chrom.sizes")
# Read the table with the lengths for each chromosome
names(macaque_chr_lens) <- c("chr", "chr.len")
# Add column names since they aren't contained in the file
macaque_chr_lens
## chr chr.len
## 1 chr1 225584828
## 2 chr2 204787373
## 3 chr5 190429646
## 4 chr3 185818997
## 5 chr6 180051392
## 6 chr4 172585720
## 7 chr7 169600520
## 8 chrX 149150640
## 9 chr8 144306982
## 10 chr11 133663169
## 11 chr9 129882849
## 12 chr14 127894412
## 13 chr12 125506784
## 14 chr15 111343173
## 15 chr13 108979918
## 16 chr17 95684472
## 17 chr10 92844088
## 18 chr16 77216781
## 19 chr20 74971481
## 20 chr18 70235451
## 21 chr19 53671032
## 22 chrY 11753682
# View the table
Ok, so this is a pretty simple dataset: just the chromosome label its the length in base pairs. Let’s generate a scatter plot of chromosome length vs. average SV length.
Run the code block below to make a scatter plot of chromosome length and average SV length:
macaque_svs_chr_merge <- merge(macaque_svs_chr, macaque_chr_lens, by="chr")
# First, merge the chromosome length data with our SV length data
macaque_svs_chr_merge <- macaque_svs_chr_merge %>% filter(chr != "chrY")
# Remove the Y chromosome now since it is too much of an outlier
sv_chr_len_scatter <- ggplot(macaque_svs_chr_merge, aes(x=chr.len, y=mean.len, color=chr)) +
geom_point(size=4) +
geom_smooth(method="lm", se=F, fullrange=T, color="#333333")
# Plot chromosome length and average sv length as points on a scatter plot and add a regression line with geom_smooth()
print(sv_chr_len_scatter)
## `geom_smooth()` using formula = 'y ~ x'
# Display the plot
Ok, so it looks like there is some correlation between chromosome length and average SV length. But how can we tell which chromosomes are which? One way is with colors (and a legend):
Exercise: Add a
color=chr
parameter to the aesthetics of the plot.
Now we can at least link each point to a label in the legend (though the default color scheme here makes some of them difficult to discern).
However this is still a discrete coloring.
We can color these points by a third variable. Remember the “score” variable we have for each SV? This is a continuous variable so let’s color each point by average score instead to see which chromosomes have the highest scoring SVs and see if there is any pattern between the three variables (chromosome length, mean SV length, and mean score).
Run the following code block to re-generate the scatter plot and color the points by score:
## Load the viridis package here after installation
library(viridis)
## Load the viridis package here after installation
macaque_svs_sep$score <- as.numeric(macaque_svs_sep$score)
# Convert the score to a numeric data type so we can average them per chr
macaque_svs_chr_score <- macaque_svs_sep %>% group_by(chr) %>% summarize(mean.len=mean(length), mean.score=mean(score, na.rm=T))
# Group the data by chr and average both the length and the score per chr
macaque_svs_chr_score <- merge(macaque_svs_chr_score, macaque_chr_lens, by="chr")
# Merge the chromosome length data with our SV length data
macaque_svs_chr_score <- macaque_svs_chr_score %>% filter(chr != "chrY")
# Remove the Y chromosome now since it is too much of an outlier
sv_chr_len_scatter <- ggplot(macaque_svs_chr_score, aes(x=chr.len, y=mean.len, color=mean.score, label=chr)) +
geom_point(size=4) +
geom_smooth(method="lm", se=F, fullrange=T, color="#333333") +
## Add a geom_text() layer here when prompted
geom_text(hjust=-0.2, vjust=0.2) +
## Add a geom_text() layer here when prompted
## Add a scale_color_continous() layer here when prompted
#scale_color_continuous(low="#920000", high="#f0e442") +
## Add a scale_color_continous() layer here when prompted
## Add a scale_color_viridis() layer here when prompted
scale_color_viridis(option="A") +
## Add a scale_color_viridis() layer here when prompted
## Add a theme() layer here when prompted
theme_classic() %+replace%
theme(axis.text=element_text(size=12),
axis.title=element_text(size=16),
axis.title.y=element_text(margin=margin(t=0,r=10,b=0,l=0),color="black",angle=90),
axis.title.x=element_text(margin=margin(t=10,r=0,b=0,l=0),color="black"),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size=1),
axis.ticks.length=unit(0.2,"cm"),
legend.position="right",
legend.key.width=unit(0.75, unit="cm"),
legend.spacing.x=unit(0.25, 'cm'),
legend.title=element_blank(),
legend.text=element_text(size=12),
plot.title=element_text(hjust=0.5, size=20),
plot.margin=unit(c(1,1,1,1), "cm")
)
## Add a theme() layer here when prompted
# Plot chromosome length and average sv length as points on a scatter plot and add a regression line with geom_smooth()
print(sv_chr_len_scatter)
# Display the plot
Here, we’ve essentially visualized 3 variables at once by adding the color aesthetic. No patterns between score and the other two variables pop out to me.
So far, we’ve only used the default ggplot color scales. But these
are completely customizable. We’ve already seen how to define
manual color and fill scales for
discrete variables (see code blocks
iris-plot-5
and sv-len-violin-type-2
). For
continuous variables, the process is similar but instead of defining
exact color values, we define the extremes of the color gradient.
For instance, in the code block above, add the following layer to the scatter plot:
scale_color_continuous(low="#920000", high="#f0e442")
There are other layer types for continuous
variables, like scale_color_gradient()
that offer even more
control over the colors.
Of course, we don’t always want to be picking colors ourselves. Fortunately there are several pre-compiled color palettes out there, most popular among them is the viridis color scale. These can be easily added as layers to a ggplot with the viridis package. To use viridis color scales, do the following:
Exercise: 1. Use the Packages tab in the lower right hand panel of the RStudio interface to install the viridis package. 2. Once installed, load the package in the previous code block and replace the
scale_color_continuous()
layer with ascale_color_viridis()
layer.
Viridis layers have several options for different color scales
available with the option=
parameter.
What happens when you add
option="A"
to thescale_color_viridis()
layer? Try out different letter options to see which one you like.
Viridis scales can also be used on discrete variables with the
discrete=TRUE
parameter.
One major reason for summarizing data in visual represenations is to convey its meaning and interpretation to the viewer. And that means its important that every viewer is able to interpret the data readily. Using colorblind-friendly color palettes ensures this is the case and is something I always strive to do.
According to this page, the default colors used by ggplot can be difficult to discern for viewers with colorblindness. Fortunately, there are plenty of resources to help us pick out colorblind-friendly palettes, as a quick internet search will show. The same page also shows that viridis and its variants are more colorblind friendly.
Here are my preferred discrete colorblind-friendly palettes:
c("#db6d00", "#004949", "#006ddb", "#920000", "#490092", "#6cb6ff", "#24ff24", "#fdb4da", "#ffff6d", "#009292", "#924900", "#000000")
# I obtained this palette from a web resource that is no longer active :(
c("#e69f00", "#56b4e9", "#009e73", "#f0e442", "#0072b2", "#d55e00", "#cc79a7", "#000000")
# I got this palette from Claus Wilke's book
For continuous scales, I would use Viridis.
Another way to differentiate discrete points is by adding text to the plot. For instance, in our scatter plot above, we know that each point corresponds to a different chromosome, but it is unclear which point is which chromosome. Even with the discrete coloring by chromosome there are simply too many similar points to tell clearly.
Fortunately, another aesthetic can be added:
label
. With label=
, we can use additional
layers to tell how to annotate the plot with text.
In the code block above, add the
label=chr
aesthetic, and then add the followinggeom_text()
layer:
geom_text()
This is ok, but the points and labels overlap so we can’t read them!
We’ll need to adjust the position of the labels with the
hjust=
and vjust=
options in the
geom_text()
layer:
Add the following options to the
geom_text()
layer you just added in the code block above to slightly shift their positions off of the points:
geom_text(hjust=-0.2, vjust=0.2) +
Better! However, this can be tedious. Fortunately, there is a package called ggrepel that does this automatically. We won’t worry about installing this now, but I just wanted you to be aware of it.
The “theme” of a plot refers to the style elements not related to the
aesthetics, such as axis font sizes, grid-lines,
margins, etc. Until now, we have been using the default ggplot theme
elements, but every aspect of the theme of a plot is customizable by
adding the catch-all theme()
layer. ggplot also provides
some pre-set
alternate themes as layers.
Let’s try some of these pre-set themes out first.
Exercise: In the code block above, add the following layers onto one of the plots we’ve generated in today’s class, one at a time:
theme_grey()
: This is the default ggplot theme, so
shouldn’t change anythingtheme_bw()
theme_linedraw()
theme_light()
theme_dark()
theme_minimal()
theme_classic()
theme_void()
theme_test()
The theme()
layer is rich with options, which you will
have to explore when generating your own figues. Here are my preferred
theme settings, which are a variant on theme_classic. >
Exercise: > Try adding this theme()
layer to the scatterplot in the code block above:
theme_classic() %+replace%
theme(axis.text=element_text(size=12),
axis.title=element_text(size=16),
axis.title.y=element_text(margin=margin(t=0,r=10,b=0,l=0),color="black",angle=90),
axis.title.x=element_text(margin=margin(t=10,r=0,b=0,l=0),color="black"),
axis.line=element_line(colour='#595959',size=0.75),
axis.ticks=element_line(colour="#595959",size=1),
axis.ticks.length=unit(0.2,"cm"),
legend.position="right",
legend.key.width=unit(0.75, unit="cm"),
legend.spacing.x=unit(0.25, 'cm'),
legend.title=element_blank(),
legend.text=element_text(size=12),
plot.title=element_text(hjust=0.5, size=20),
plot.margin=unit(c(1,1,1,1), "cm")
)
We saw above that facet_wrap()
can provide us with some
rudimentary multi-panel figures, as long as the panels are meant to be
different groupings from the same variable and the same type of plot.
facet_grid()
gives a bit more control over the placement of
the panels, but for true multi-panel figures, that mix data sources and
aesthetics (plot types), we’ll need to look elsewhere. There are many
packages that work with ggplot that give us the ability to group
different graphical objects (or grobs) into grids and
label each panel. The one we’re going to show you today is called cowplot
which implements the function plot_grid()
to combine
grobs.
So, first, we’ll need to install the cowplot package. Let’s do that in the Packages tab on the right.
With cowplot installed, we can combine two of the figures we’ve already generated in the workshop.
Run the following code block to combine our scatter-plot of chromosome length vs. average SV length and the violin plot showing the differences in length between deletions and duplications:
scatter_and_violin <- plot_grid(sv_chr_len_scatter + theme_grey(), sv_len_violin_type_clr, ncol=2, labels=c("A", "B"))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label
## i This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## i Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
# Combine the scatter and violin plots into one grob
print(scatter_and_violin)
# Display the figure
Now, let’s switch this from two rows to two columns by changing the
nrow=
option toncol=
.
Looks good! Except maybe that the themes of the two figures don’t match anymore since we changed the scatterplot. Recall that grobs are stored as objects and can therefore be manipulated later.
Let’s change the theme back by adding
+ theme_grey()
as we’re adding the scatterplot to the multi-panel figure inplot_grid()
in the code block above.
We can also combine this multi-panel figure with other figures with
another call to plot_grid()
.
Run the following code to add a row to our figure showing the barplot of proportions of SV types per chromosome:
## When prompted, add plot_grid() for the barplot here
barplot_row <- plot_grid(NULL, sv_count_bar_chr_prop, NULL, ncol=3, labels=c("", "C", ""), rel_widths=c(0.1,1,0.1))
## When prompted, add plot_grid() for the barplot here when prompted
figure_1 <- plot_grid(scatter_and_violin, barplot_row, nrow=2)
# Combine the two panels generated above with the barplot showing proportions
print(figure_1)
# Display the figure
But what if we don’t want the bar plot to take up the whole bottom
row? Well, we can combine two other functions of
plot_grid()
: NULL
panels and relative panel
widths (rel_width()
).
Add the following code to the code block above, and then replace the name of the barplot in the second
plot_grid()
call with the new grob (barplot_row
).
barplot_row <- plot_grid(NULL, sv_count_bar_chr_prop, NULL, ncol=3, labels=c("", "C", ""), rel_widths=c(0.1,1,0.1))
Now that’s a figure (almost) ready for publication!
cowplot has other capabilities that we won’t cover today, such as
plot_grid()
receiving a list of grobs as input (as
opposed to typing the name of each one out) and the
get_legend()
function that allows us to save only the
legend from a particular grob as its own grob, which can then be added
onto a multi-panel figure as a shared legend.