Learning Objectives
- Produce scatter plots, boxplots, density plots, and time series plots using ggplot2.
- Set universal and local plot settings.
- Describe what aesthetics are and how they are used by ggplot().
- Describe what faceting is and apply faceting to a ggplot().
- Modify the aesthetics of an existing ggplot() plot (e.g., axis labels, color).
- Build multivariate and customized plots from data in a data frame.
- Arrange multiple plots in a grid format.
- Export publication ready graphics using ggsave().
Orientation of/for the workshop
This workshop assumes some basic familiarity with working in
R
such as what you might obtain in the “Introduction to R” workshop or in course that usesR
regularly. If you have not interacted withR
previously, some of the assumptions of your background for this workshop might be a barrier. We would recommend getting what you can from this workshop and you can always revisit the materials at a later date after filling in some of those basicR
skills. We all often revisit materials and discover new and deeper aspects of the content that we were not in a position to appreciate in a first exposure.In order to focus this workshop on coding, we developed this interactive website for you to play in a set of “sandboxes” and try your hand at implementing the methods we are discussing. When each code chunk is ready to run (all can be edited, many have the code prepared for you), you can click on “Run Code”. This will run
R
in the background on a server. For the “Challenges”, you can get your answer graded although many have multiple “correct” answers, so don’t be surprised if our “correct” answer differs from yours. The “Solution” is also provided in some cases so you can see a solution - but you will learn more by trying the challenge first before seeing the answer. Each sandbox functions independently, which means that you can pick up working at any place in the documents and re-set your work without impacting other work (this is VERY different from howR
usually works!). Hopefully this allows you to focus on the code and what it does… The “Start over” button can be used on any individual sandbox or you can use the one on the left tile will re-set all the code chunks to the original status.These workshops are taught by Greta Linse and Sara Mannheimer and co-organized by the MSU Library, Statistical Consulting and Research Services (SCRS), and the Department of Mathematical Sciences. More details on us and other workshops are available at the end of the session or via https://www.montana.edu/datascience/training/#workshop-recordings.
A PDF version of this Workshop from last year (only minor differences) is available at: Data Viz R Fall 2022
Data Viz Introduction
ggplot2
is a plotting package that makes it simple to create complex plots from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatter plot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.
Packages in R
are basically sets of additional functions that let you do more stuff. The functions we’ve used in the previous workshop, like str()
or mean()
, come built into R
; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R
session when you need it. If you were to do this work in RStudio, you would need to install the tidyverse
package. This is an “umbrella-package” that installs several packages useful for data analysis which work together well such as tidyr
, dplyr
, ggplot2
, readr
, forcats
, etc.
The tidyverse
package tries to address common issues that arise when doing data analysis with some of the functions that come with R
.
- The
tidyverse
solves complex problems by combining many simple pieces.
“No matter how complex and polished the individual operations are, it is often the quality of the glue that most directly determines the power of the system.”
— Hal Abelson
- The
tidyverse
is written for people to read!
“Computer efficiency is a secondary concern because the bottleneck in most data analysis is thinking time, not computing time.”
— Hadley Wickham
In this workshop, we have already installed the tidyverse
using install.packages("tidyverse")
. It is important to note that there’s no need to re-install packages every time we run the script.
Then, to load the package include code in your work with:
## load the tidyverse packages
library(tidyverse)
Working with packages was discussed in more detail in the “Introduction to R” workshop. We will proceed through the remaining work with the tidyverse
package installed and loaded.
To learn more about ggplot2
after the workshop, you may want to check out this ggplot2
reference website (link) and this handy cheatsheet on ggplot2
(link).
Presentation of the Survey Data
The data used in this workshop are a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years, but we will focus on the years 1996 to 2002 (n=11332 observations). The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots. This is simplified version of the full data set that has been used in over 100 publications and was provided by the Data Carpentries (https://figshare.com/articles/dataset/Portal_Project_Teaching_Database/1314459/10). We are going to focus on animal species diversity and weights in this workshop. The dataset is stored as a comma separated value (CSV) file.
Each row holds information for a single animal, and the columns represent (along with some others we will not use):
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
We’ll read in our data using the read_csv()
function, from the tidyverse package readr
, instead of read.csv()
.
surveys <- read_csv("https://raw.githubusercontent.com/saramannheimer/data-science-r-workshops/master/Data%20Visualization/AY%202020-2021/data/surveys2.csv")
You will see the message Parsed with column specification
, followed by each column name and its data type. When you execute read_csv
on a data file, it looks through the first 1000 rows of each column and guesses the data type for each column as it reads it into R
. For example, in this dataset, read_csv
reads weight
as col_double
(a numeric data type), and species
as col_character
.
## inspect the data
str(surveys)
## Preview the data
View(surveys)
At the top of the str()
output, notice that the class of the data is a tibble. Tibbles tweak some of the behaviors of the data frame objects we introduced in the previous workshop. The data structure is very similar to a data frame, so for our purposes the only differences are that:
- In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
- Columns of class
character
are never converted into factors.
Plotting with ggplot2
ggplot2
functions like data in the ‘long’ format, i.e., a column for every dimension, and a row for every observation. There are other data formats, which we will discuss in the Data Wrangling in R
workshop, as well as how to convert from one data format to another. Well-structured data will save you lots of time when making figures with ggplot2
and when working in R
!
ggplot()
graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot()
, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<VARIABLE MAPPINGS>)) + <GEOM_FUNCTION>()
Let’s go through this step by step!
- Use the
ggplot()
function and bind the plot to a specific data frame using thedata
argument
ggplot(data = surveys)
## Creates a blank ggplot(), referencing the surveys dataset
- Define a mapping (using the aesthetic (
aes
) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length))
#
# Creates a blank ggplot(), with the variables mapped to the x- and y-axis
# ggplot() knows where the variables live, since you have defined the data to use
Add “geoms” – graphical representations of the data in the plot (points, lines, bars).
ggplot2
offers many different geoms; we will use some common ones today, including:geom_point()
for scatter plots, dot plots, etc.geom_boxplot()
for boxplots
geom_bar()
for bar chartsgeom_line()
for trend lines, time series, etc.
To add a geom to the plot use the
+
operator. Because we have two continuous variables in the data, let’s usegeom_point()
first:
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
# Adds a point for each row (observation) in the data
You can think of the +
sign as adding layers to the plot. Each +
sign must be placed at the end of the line containing the previous layer. If, instead, the +
sign is added at the beginning of the line containing the new layer, ggplot2
will not add the new layer and will return an error message.
# This will not add the new layer and will return an error message
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length))
+ geom_point()
Building Plots Iteratively
Building plots with ggplot2
is typically an iterative process. We start by defining the dataset we’ll use, lay out the axes, and choose a geom:
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()
Then, we start modifying this plot to extract more information from it. For instance, we can add transparency (alpha
) to the points, to avoid overplotting:
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.2)
## alpha reduces the opacity of the points
## 0 is fully transparent
## 1 is the original opacity
We can also add colors for all the points:
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.2, color = "blue")
geom_point
also accepts aesthetics of size and shape. The size of a point is its width in mm. The shape of a point has five different options for plotting:
- an integer [0, 25] of defined plotting characters – same as base
R
- the name of the shape in quotations (e.g. “circle open” or “diamond filled”)
- a single character, to use that character as a plotting symbol
- a “.” to draw the smallest point that is visible – typically 1 pixel
- an NA, to draw nothing
Note: A “filled” shape means that it is filled with another color or is “open” if no fill color is specified. A solid shape has no further specification.
Reference for shapes in integers and characters:
https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
Challenge 1
Copy and paste the code from the previous code chunk and modify it to assign one of these aesthetics to the
geom_point
aspect of your plot.What happened?
## Your ggplot code to answer the challenge goes here!
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.2, color = "blue", shape = "diamond")
Piping Data In
Because ggplot2
lives in the tidyverse
, it is expected to work well with other packages in the tidyverse
. Because of this, the first argument to creating a ggplot()
is the dataset you wish to be working with. The pipe operator sends the output of one function directly into the next function, which is useful when you need to do many things to the same dataset. Since the dataset we wish to use is the first argument to ggplot()
, we can use the pipe operator to pipe the data into the ggplot()
function!
Pipes in R
look like %>%
and are made available via the magrittr
package, installed automatically with the tidyverse
. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
Note: There is now (as of R 4.3.0) a Base R pipe
|>
that works similar to the%>%
pipe operator with minor differences that you may encounter, but since we are working in thetidyverse
we will stick with their pipe operator (%>%
).
This would instead look like this:
surveys %>%
## data to be used in the ggplot
ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
## uses the data piped in as the first argument to ggplot()
geom_point(alpha = 0.2, color = "blue")
Once we pipe the data in, the first argument becomes the mapping
of the aesthetics. Technically, we are using the name of this argument, which is why it looks like:
mapping = aes(<VARIABLES>)
When we pipe our data in, the first argument then becomes this mapping
argument.
Assigning More Variables to Aesthetics
To color each species in the plot differently, you could use a vector as an input to the argument color. ggplot2
will provide a different color corresponding to different values in the vector. Here is an example where we color with species_id
:
surveys %>%
ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.2, aes(color = species_id))
Note: When specifying an
alpha
for a scatterplot, it automatically uses that samealpha
in the legend. To remedy this you can add:
guides(color = guide_legend(override.aes = list(alpha = 1)))
to your plot. This customizes the legend appearance, similar to what we will see in the customization section.
We can also specify the colors directly inside the mapping provided in the ggplot()
function. This will be seen by any geom layers and the mapping will be determined by the x- and y-axis set up in aes()
.
surveys %>%
ggplot(mapping = aes(x = weight, y = hindfoot_length, color = species_id)) +
geom_point(alpha = 0.2)
Notice that we can change the geom layer and colors will be still determined by species_id
Local Aesthetics versus Global Aesthetics
When you define aesthetics in the ggplot()
function, those mappings hold for every aspect of your plot.
For example, if you chose to add a smoothing line to your plot of weight versus hindfoot length, you would get different lines depending on where you define your color aesthetics.
Globally
surveys %>%
ggplot(mapping = aes(x = weight, y = hindfoot_length, color = species_id)) +
geom_jitter(alpha = 0.2) +
geom_smooth()
## smoothing line for each species_id -- because color is defined globally
Locally
surveys %>%
ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
geom_jitter(aes(color = species_id), alpha = 0.2) +
geom_smooth()
## one smoothing line -- no color defined globally
Challenge 2 (Part 1)
Inspect the
geom_point
help file (either go to https://ggplot2.tidyverse.org/reference/geom_point.html or run?geom_point
) to see what other aesthetics are available. Map a new variable from the dataset to another aesthetic in your plot. What happened? Does the aesthetic change if you use a continuous variable versus a categorical/discrete variable?
## Your ggplot() code for the challenge goes here!
surveys %>%
ggplot(mapping = aes(x = weight, y = hindfoot_length)) +
geom_jitter(aes(color = plot_type), alpha = 0.2) +
geom_smooth()
Challenge 2 (Part 2)
Use what you just learned to create a scatter plot of
weight
overplot_id
with data from different plot types being showed in different colors. Is this a good way to show this type of data?
## Your ggplot() code for the challenge goes here!
surveys %>%
ggplot(mapping = aes(x = plot_id, y = weight)) +
geom_jitter(aes(color = plot_type), alpha = 0.2) +
geom_smooth()
Boxplots & Violin Plots
Boxplots provide a visualization of a quantitative variable across different levels of a categorical (grouping) variable. For example, we can use boxplots to visualize the distribution of weight within each species:
surveys %>%
ggplot(mapping = aes(x = species_id, y = weight)) +
geom_boxplot()
By adding points to boxplot, we can have a better idea of the number of measurements and their distribution:
surveys %>%
ggplot(mapping = aes(x = species_id, y = weight)) +
geom_boxplot(alpha = 0) +
## alpha = 0 eliminates the black (possible outlier) points, so they're not plotted twice
geom_jitter(alpha = 0.2, color = "tomato")
## alpha = 0.2 decreases the opacity of the points, to not be too busy
Did you notice how the boxplot layer is behind the jitter layer? What would you change in the code to put the boxplot in front of the points? (We’ll come back to this…)
Challenge 3 (Part 1)
Boxplots are useful summaries, but hide details of the shape of the distribution. For example, if the distribution is bimodal, we would not see it in a boxplot. A superior density plot is the violin plot, where the shape (of the density of points) is drawn.
Replace the box plot with a violin plot. For help see
geom_violin()
. Start with the boxplot we created:
surveys %>%
ggplot(mapping = aes(x = species_id, y = weight)) +
geom_boxplot(alpha = 0) +
geom_jitter(alpha = 0.2, color = "tomato")
## Start with the boxplot we created
## 1. Replace the boxplot with a violin plot. For help, see geom_violin().
## You might need to decrease opacity even more to see the violins (try 0.03)
surveys %>%
ggplot(mapping = aes(x = species_id, y = weight)) +
geom_violin() +
geom_jitter(alpha = 0.03, color = "tomato")
Challenge 3 (Part 2)
So far, we’ve looked at the distribution of weight within species. Let’s try making a new plot to explore the distribution of another variable within each species.
Create a boxplot for
hindfoot_length
. This time overlay the boxplot layer over a jitter layer that shows the actual measurements.
## First: create boxplot for hindfoot_length` overlaid on a jitter layer.
surveys %>%
ggplot(mapping = aes(x = species_id, y = hindfoot_length)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(alpha = 0)
Challenge 3 (Part 3)
Now, add color to the data points on your boxplot according to the plot from which the sample was taken (
plot_id
).Hint: Check the class for
plot_id
. Ifplot_id
was a character instead, how would the graph be different?
## Next: add color to the data points on your boxplot according to the
## plot from which the sample was taken (plot_id).
## Hint: Check the class for plot_id`. If plot_id was a character instead,
## how would the graph be different?
surveys %>%
ggplot(mapping = aes(x = species_id, y = hindfoot_length)) +
geom_jitter(alpha = 0.3, mapping = aes(color = plot_id)) +
geom_boxplot(alpha = 0)
## Checking the data type for plot_id
class(surveys$plot_id)
## Creating a new variable named plot_id_chr
## which is the character version of plot_id
surveys <- surveys %>%
mutate(plot_id_chr = as.character(plot_id))
## Using new character plot_id to make a boxplot
surveys %>%
ggplot(mapping = aes(x = species_id, y = hindfoot_length)) +
geom_jitter(alpha = 0.3, mapping = aes(color = plot_id_chr)) +
geom_boxplot(alpha = 0)
Bonus violin plot example (DatasauRus!)
The previous example doesn’t fully illustrate the power of violin plots. This example from the
datasauRus
package (https://www.autodeskresearch.com/publications/samestats) shows five different distributions that have exactly the same summary statistics and boxplots but very different shapes:
data(box_plots)
box_plots_long <- box_plots %>%
pivot_longer(cols=1:5)
box_plots_long <- box_plots_long %>%
rename(Example=name, Response=value)
box_plots_long <- box_plots_long %>%
mutate(Example = factor(Example))
ggplot(box_plots_long, aes(x=Example, y=Response)) +
geom_violin(fill="bisque") +
geom_boxplot(alpha=.6) +
geom_jitter(alpha=0.15, col="tomato") +
theme_minimal() +
scale_x_discrete(labels=c('Trimodal, Right Skew', 'Clumpy', 'Normal', 'Trimodal, Left Skew', 'Bimodal'))
Plotting Single Variables
Distribution Plots (Quantitative Variables)
If we wish to visualize the distribution of a single quantitative variable, our plot changes a bit. Unfortunately, the geom_violin()
function only accepts groups, so we cannot make a violin plot with no groups. Darn it!
But, a violin is simply a density plot that’s been reflected across the y-axis. So, we could likely suffice with a density plot.
To visualize the distribution of rodent weights we could aggregate over all species, years, plots, etc. and produce a single density plot:
surveys %>%
ggplot(mapping = aes(x = weight)) +
geom_density()
The default is an empty density plot, which is largely unsatisfying. By adding a fill = <COLOR>
argument to geom_density()
we can produce a nicer looking plot:
surveys %>%
ggplot(mapping = aes(x = weight)) +
geom_density(fill = "sky blue")
Another frequently used plot for a single quantitative variable is the histogram. The same plot as above can be recreated using geom_histogram()
instead of geom_density()
. However, when you use geom_histogram()
it gives you a warning.
surveys %>%
ggplot(mapping = aes(x = weight)) +
geom_histogram()
What warning do you get and why? Do you get an error like this when you use
hist()
in baseR
?
There is no single right answer for the number of bins. There are some “plug-in” choices for number of bins that can be used, but you are always welcome to explore different numbers of bins to see if features you are seeing persist when you choose more or less bins.
Challenge 4
Use the
bins
argument ingeom_histogram()
to play around with the number of bins in your histogram. Try different numbers of bins to explore how that changes the results!
## Your code to answer the challenge goes here!
surveys %>%
ggplot(aes(x = weight)) +
geom_histogram(fill = "sky blue" , bins = 50)
Bar Charts (Categorical Variables)
At first glimpse, you would think that a bar plot would be simple to create, but bar plots reveal a subtle nuance of the plots we have created thus far. The following bar chart displays the total number of rodents in the surveys
dataset, grouped by their species ID.
surveys %>%
ggplot(mapping = aes(x = species_id)) +
geom_bar()
The x-axis displays the levels of species_id
, a variable in the surveys
dataset. On the y-axis count
is displayed, but count
is not a variable in our dataset! Where did count
come from? Graphs, such as the scatterplots, display the raw values of your data. Other graphs, like bar charts and boxplots, calculate new values (from your data) to plot.
Bar charts and histograms bin your data and then plot the number of observations that fall in each bin.
Boxplots find summaries of your data (min, max, quartiles, median) and plot those summaries in a tidy box, with “potential outliers” (data over 1.5*IQR from Q1 or Q3) plotted as points.
Smoothers (as used in
geom_smooth
) fit a model to your data (you can specify, but we used thegam
(generalized additive model from themgcv
package) default) and then plot the predicted means from that model (with associated 95% confidence intervals).
To calculate each of these summaries of the data, R
uses a different statistical transformation, or stat for short. With a bar chart this looks like the following process:
geom_bar
first looks at the entire data frame
geom_bar
then transforms the data using thecount
statistic
- the
count
statistic returns a data frame with the number of observations (rows) associated with each level ofspecies_id
geom_bar
uses this summary data frame, to build the plot – levels ofspecies_id
are plotted on the x-axis andcount
is plotted on the y-axis
Generally, you can use geoms
and stats
interchangeably. This is because every geom
has a default stat
and vice versa. For example, the following code produces the same output as above:
surveys %>%
ggplot(mapping = aes(x = species_id)) +
stat_count()
If you so wish, you could override the default stat
for that geom
. For example, if you wanted to plot a bar chart of proportions you would use the following code to override the count
stat:
surveys %>%
ggplot(mapping = aes(x = species_id)) +
geom_bar(aes(y = after_stat(prop), group = 1))
Challenge 5
Why do we need to set
group = 1
in the above proportion bar chart? In other words, what is wrong with the plot below?
## What is wrong with this plot?
surveys %>%
ggplot(mapping = aes(x = species_id)) +
geom_bar(aes(y = after_stat(prop)))
Colored and/or Stacked Bar Charts
Another piece of visual appeal to creating a bar chart is the ability to use colors to differentiate the different groups, or to plot two different variables in one bar chart (stacked bar chart). Let’s start with adding color to our bar chart.
Coloring Bars
As we saw before, to add a color aesthetic to the plot we need to map it to a variable. However, if we use the color
option that we used before we get a slightly unsatisfying result.
surveys %>%
ggplot(mapping = aes(x = species_id, color = species_id)) +
geom_bar()
We notice that the color only appears in the outline of the bars. For a bar chart, the aesthetic that we are interested in is the fill
of the bars.
Challenge 6
Change the code below so that each bar is filled with a different color.
surveys %>%
ggplot(mapping = aes(x = species_id, color = species_id)) +
geom_bar()
surveys %>%
ggplot(mapping = aes(x = species_id, fill = species_id)) +
geom_bar()
Stacking Bars
Now suppose you are interested in whether the number of male and female rodents captured differs by species. This would require for you to create a bar plot with two categorical variables. You have two options:
- each of the bars for sex could be stacked within a species OR
- the bars for sex could be side-by-side within a species
Let’s see how the two approaches differ. To stack bars of a second categorical variable we would instead use this second categorical variable as the fill
of the bars. Run these two lines of code and see how they differ.
surveys %>%
ggplot(mapping = aes(x = species_id, fill = sex)) +
geom_bar()
surveys %>%
ggplot(mapping = aes(x = species_id, fill = sex)) +
geom_bar(position = "dodge")
In the first plot, the position was chosen automatically, but in the second plot the position
argument was made explicit. What changes did this make in the plots?
Finally, we can also choose the position
to be fill
for the bars and to fill
the bars based on sex
:
surveys %>%
ggplot(mapping = aes(x = species_id, fill = sex)) +
geom_bar(position = "fill")
Notice that the y-axis label still says “count” instead of “proportion”. We will learn how to change labels later when we discuss Customization.
Time-series Data
Let’s calculate number of counts per year for each genus.
What you will see in Data Wrangling: First we need to group the data and count records within each group!
yearly_counts <- surveys %>%
count(year, genus)
## counts the number of observations (rows) for each year, genus combination and creates a new variable "n"
yearly_counts
Time series data can be visualized as a line plot with years on the x-axis and counts on the y-axis:
yearly_counts %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line()
Unfortunately, this does not work because we plotted data for all the genera together. We need to tell ggplot()
to draw a line for each genus by modifying the aesthetic function to include group = genus
:
yearly_counts %>%
ggplot(mapping = aes(x = year, y = n, group = genus)) +
geom_line()
Unfortunately, we can’t tell what line corresponds to which genus. We will be able to distinguish genera in the plot if we add colors (using color
also automatically groups the data):
yearly_counts %>%
ggplot(mapping = aes(x = year, y = n, color = genus)) +
geom_line()
Note: When specifying the color for a line graph, you don’t need to use both the
color = <VARIABLE>
argument and thegroup = <VARIABLE>
argument. Both do the same grouping of observations! So you just need to specify thecolor
argument.
Faceting
ggplot2
has a special technique called faceting that allows the user to split one plot into multiple plots based on a categorical variable included in the dataset.
There are two types of facet
functions:
facet_wrap()
arranges a one-dimensional sequence of panels to allow them to cleanly fit on one page – used for one variablefacet_grid()
allows you to form a matrix of rows and columns of panels – used for two variables
Both geometries allow you to specify faceting variables using formula notation or the vars()
function. We will use the formula notation, as you will use this notation when creating models (not covered in this workshop).
This looks like: facet_wrap(facets = ~ facet_variable)
or facet_grid(row_variable ~ col_variable)
.
Let’s start by using facet_wrap()
to make a time series plot for each species:
yearly_counts %>%
ggplot(mapping = aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = ~ genus)
Now we would like to split the line in each plot by the sex of the rodent captured. To do that we need to make counts in the data frame grouped by year
, species_id
, and sex
:
yearly_sex_counts <- surveys %>%
count(year, species_id, sex)
## counts the number of observations (rows) for each year, species, sex combination
yearly_sex_counts
We can now make the faceted plot by splitting further by sex using color
(within each panel):
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(facets = ~ species_id)
You can also organize the panels only by rows (or only by columns), using the optional nrow
and ncol
arguments:
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id, ncol = 1)
# One column, facet by rows
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id, nrow = 1)
# One row, facet by columns
Now let’s use facet_grid()
to control how panels are organized by both rows and columns:
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(sex ~ species_id)
Challenge 7
Use what you just learned to create a plot that depicts how the average weight of each species changes through the years. Play around with which variable you facet by versus plot by!
## To get you started:
yearly_species_weight <- surveys %>%
group_by(year, species_id) %>%
## Variables to group by
summarize(avg_weight = mean(weight))
## Edit the following ggplot() code for the plot here:
yearly_species_weight %>%
ggplot(mapping = aes(x = year, y = n, color = avg_weight)) +
geom_line() +
facet_wrap(facets = ~ species_id)
## Your ggplot() code for the plot goes here!
## To get you started:
yearly_species_weight <- surveys %>%
## Variables to group by:
group_by(year, species_id) %>%
summarize(avg_weight = mean(weight))
## Coloring by species id
yearly_species_weight %>%
ggplot(mapping = aes(x = year, y = avg_weight, color = species_id)) +
geom_line()
## Faceting by species id
yearly_species_weight %>%
ggplot(mapping = aes(x = year, y = avg_weight, color = species_id)) +
geom_line() +
facet_wrap(~ species_id)
ggplot2
Themes
Usually plots with white background look more readable when printed. Every single component of a ggplot()
graph can be customized using the generic theme()
function, as we will see below. However, there are pre-loaded themes available that change the overall appearance of the graph without much effort.
For example, we can change our previous graph to have a simpler white background using the theme_bw()
function:
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw()
In addition to theme_bw()
, which changes the plot background to white, ggplot2
comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html. theme_minimal()
and theme_light()
are popular, and theme_void()
can be useful as a starting point to create a new hand-crafted theme.
The ggthemes package provides a wide variety of options. The ggplot2
extensions website provides a list of packages that extend the capabilities of ggplot2
, including additional themes.
Challenge 8
Use what you just learned to add the plotting background theme of your choosing to the plot you made in Challenge 7!
## Your ggplot() code for the plot goes here!
yearly_species_weight %>%
ggplot(mapping = aes(x = year, y = avg_weight, color = species_id)) +
geom_line() +
theme_classic()
Customization
Take a look at the ggplot2
cheat sheet, and think of ways you could improve the previous plot.
Plot Labels
Now, let’s change names of axes to something more informative than ‘year’ and ‘n’ and add a title to the figure. Label customizations are done using the labs()
function like so:
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw() +
labs(title = "Observed Species Through Time",
x = "Year of Observation",
y = "Number of Rodents",
color = "Sex")
Tip: Wrapping Titles
Sometimes the titles we wish to have for our plots are longer than the space originally allotted. If you create a title and the text is running off the plot you can add a
\n
inside your title to force a line break (\n
stands for new line).
Label & Plot Fonts
Note that it is also possible to change the fonts of your plots. If you are on Windows, you may have to install the extrafont
package, and follow the instructions included in the README for this package.
In the last plot, the axes have more informative names, but their readability can be improved by increasing the font size. This can be done with the generic theme()
function.
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw() +
labs(title = "Observed Species Through Time",
x = "Year of Observation",
y = "Number of Rodents",
color = "Sex") +
theme(text = element_text(size = 16))
## sets ALL the text on the plot to be size 16
Note:
theme_bw()
is a function for a specific theme andtheme()
is a generic function for a variety of different themes!
After our manipulations, you may notice that the values on the x-axis are still not properly readable. Let’s swap the orientation of the labels, so the reader doesn’t have to tilt their head when reading our plot! The coord_flip()
function easily changes the x- and y-axis.
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw() +
labs(title = "Observed Species by \n Year of Observation",
x = "Year of Observation",
y = "Number of Rodents",
color = "Sex") +
theme(text = element_text(size = 16)) +
coord_flip()
This definitely makes the reader tilt their head less! But, the text on the x-axis is a bit too large to separate the numbers. We can specify the text size for each element of the plot independently, if we so wish. This would look something like this:
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
theme_bw() +
labs(title = "Observed Species by Year of Observation",
x = "",
y = "Number of Rodents",
color = "Sex") +
theme(axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 14),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
plot.title = element_text(size = 16)) +
coord_flip()
Legend Position
By default in ggplot2
the legend is positioned on the right hand side. However, you are able to change the position of the legend to the left hand side, the top of the plot, or the bottom of the plot.
This is done by adding a legend.position
theme to the plot’s theme()
’s.
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
labs(title = "Observed Species by Year of Observation by Sex",
x = "Year of Observation",
y = "Number of Rodents",
color = "Sex") +
theme_bw() +
theme(axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 14),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
plot.title = element_text(size = 14),
legend.position = "top") +
coord_flip()
Removing Grid Lines
By default, the background of a ggplot()
contains both minor and major gridlines. These can make the plot look a bit busy and sometimes difficult for the reader to follow. As you may have guessed, to remove these gridlines, we add another theme to our plot.
This looks like this:
yearly_sex_counts %>%
ggplot(mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
labs(title = "Observed Species by Year of Observation by Sex",
x = "Year of Observation",
y = "Number of Rodents",
color = "Sex") +
theme(axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 14),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
plot.title = element_text(size = 14),
legend.position = "top",
## New themes for the grid lines
axis.line = element_line(color = "black"),
##
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
coord_flip()
Let’s break these options down!
- The
axis.line
option declares what color the x- and y-axis lines should be. (Change it to a different color, if you don’t believe me!) - The
panel.grid.major
removes the major grid (the one associated with the ticks from the x- and y-axis).
- The
panel.grid.minor
removes the minor grid (the one between the x- and y-axis ticks).
- The
panel.border
removes the border around the plot.
- The
panel.background
performs a similar action totheme_bw()
, but it keeps the border around the facet labels.
Changing Colors
The built in ggplot()
color scheme may not be what you were looking for, but don’t worry! There are many other color palettes available to use!
You can change the colors used by ggplot()
a few different ways.
Manual Specification
Add the scale_color_manual()
or scale_fill_manual()
functions to your plot and directly specify the colors you want to use. You can either:
defining a vector of colors right there (e.g.
values = c("blue", "black", "red", "green")
)creating a vector of colors and storing it in an object and calling it (see below)
# A color deficient friendly palette with grey:
cbPalette_grey <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
# A color deficient friendly palette with black:
cbPalette_blk <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442",
"#0072B2", "#D55E00", "#CC79A7")
surveys %>%
ggplot(aes(x = species_id, y = hindfoot_length, color = genus)) +
geom_boxplot() +
scale_color_manual(values = cbPalette_grey)
surveys %>%
ggplot(aes(x = species_id, y = hindfoot_length, color = genus)) +
geom_boxplot() +
scale_color_manual(values = cbPalette_blk)
Package Specification
Install a package and use it’s available color scales. Popular options include:
RColorBrewer
: usingscale_fill_brewer()
orscale_color_brewer()
viridis
: usingscale_color_viridis_d()
for discrete data,scale_color_viridis_c()
for continuous data, with an inside argument ofoption = <COLOR>
for your chosen color schemeggsci
: usingscale_color_<PALNAME>()
orscale_fill_<PALNAME>()
, where you specify the name of the palette you wish to use (e.g.,scale_color_aaas()
)
Challenge 9
With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio
ggplot2
cheat sheet for inspiration. Here are some ideas:
- See if you can change the thickness of the lines.
- Try using a different color palette
- Can you find a way to change the name of the legend? What about its labels? (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/).
## your code for the challenge goes here!
color_palette <- c("#ee4444", #DM
"#670303", #DO
"#6b6b6b", #NL
"#07511b", #OL
"#00c736", #OT
"#9747ec", #PB
"#44cafd", #PE
"#dfe113", #PF
"#091298", #PM
"#562887", #PP
"#d27502", #RM
"#d2028a" #SH
)
ggplot(data = surveys, mapping = aes(x = genus, y = hindfoot_length)) +
geom_jitter(alpha = 0.2, aes(color = species_id)) +
geom_violin(alpha = 0) +
labs(x = "Genus",
y = "Hindfoot Length",
color = "Species ID") +
guides(color = guide_legend(override.aes = list(alpha = 1))) +
theme_bw() +
scale_color_manual(values = color_palette) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Arranging Plots
Faceting is a great tool for splitting one plot into multiple plots, but sometimes you may want to produce a single figure that contains multiple plots using different variables or even different data frames.
The gridExtra
package allows us to combine separate ggplots into a single figure using grid.arrange()
(make sure to scroll down in the window to see all the code):
library(gridExtra)
spp_weight_boxplot <- surveys %>%
ggplot(aes(x = genus, y = weight)) +
geom_violin() +
geom_jitter(color = "tomato", width = 0.2, alpha = 0.2) +
scale_y_log10() +
## log (base 10) transforms the y-axis variable
## (helps to make the plot less skewed)
labs(x = "",
## removes the y-axis label
y = expression(Log[10](Weight))) +
## Expression creates a mathematical expression in the axis label
## the [10] refers to the subscript next to Log
coord_flip() +
theme(axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12),
text = element_text(size = 16))
spp_count_plot <- yearly_counts %>%
ggplot(aes(x = year, y = n, color = genus)) +
geom_line() +
labs(x = "Year",
y = "Abundance",
color = "Genus") +
theme(axis.title.x = element_text(face = "bold", size = 12))
## To make your axis title boldface, this is what you need!
grid.arrange(spp_weight_boxplot, spp_count_plot, ncol = 2, widths = c(4, 6))
## nrow and ncol specify how many rows/columns you want the arranged plots to be in
## widths specify what proportion of the overall plotting area each plot takes up
In addition to the ncol
and nrow
arguments, used to make simple arrangements, there are tools for constructing more complex layouts.
For more assistance arranging plots with grid.arrage()
. We find the following vignette very helpful! https://cran.r-project.org/web/packages/egg/vignettes/Ecosystem.html
Another option for combining plots is the patchwork package. It uses a sort of formula interface for defining the layout of multiple plots. For example, you can get two plots side-by-side in a one row, two column array with p1 + p2
and two plots stacked into two rows and one column with p1 / p2
. This provides both a quick and powerful way to arrange ggplots you have created.
spp_weight_boxplot / spp_count_plot
Exporting Plots
After creating your plot, you can save it to a file in your favorite format. The Export tab in the Plot pane in RStudio will save your plots at low resolution, which will not be accepted by many journals and will not scale well for posters.
Instead, use the ggsave()
function, which allows you easily change the dimension and resolution of your plot by adjusting the appropriate arguments:
width
andheight
: adjust the total plot size in units (“in”, “cm”, or “mm”)- If units are not specified, default is inches.
dpi
: adjusts the plot resolution. This accepts a string or numeric input:- “retina” (320)
- “print” (300)
- “screen” (72)
Make sure you have the fig/
folder in your working directory.
my_plot <- ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(~ species_id) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
coord_flip() +
theme(axis.text.y = element_text(color = "grey20", size = 12),
text = element_text(size = 16))
ggsave("fig/yearly_sex_counts.png", my_plot, width = 15, height = 10)
# This also works for grid.arrange() plots
combo_plot <- grid.arrange(spp_weight_boxplot, spp_count_plot,
ncol = 2, widths = c(4, 6))
ggsave("fig/combo_plot_abun_weight.png", combo_plot, width = 10, dpi = 300)
Note: The parameters
width
andheight
also determine the font size in the saved plot.
Suggestions for your own work
The goal of this workshop was to teach you to write code in R
to learn data visualization using ggplot2
. The first workshop in our series contains more information on how to get started working in R
using RStudio (see http://www.montana.edu/datascience/training/). The code chunks in this interactive document mimic the code chunks you can use on your own projects in RMarkdown but you will need to download and install both R
and RStudio on your own computer.
Interactive Graphics (Bonus Material)
In certain situations, static displays can limit the sorts of information available and do not allow easy interrogation for information on individual aspects of plots. Obviously, most print journals do not have a way to have readers interact with the printed page, but in digital venues there are some possibilities. Of particular interest here are interactive graphics that can function on websites and in blog posts or even in certain presentation formats. One way to do this that leverages the previous work in making ggplot
-style graphics is using the ggplotly
function from the plotly
R package (Sievert, 2020). You can access the 2020 book that goes into more detail on plotly
-style graphics at https://plotly-r.com/.
To use ggplotly
, we wrap that function around a ggplot
object and it will render it in an interactive fashion when the viewer hovers over individual plot components. There are also ways of making plotly
graphs directly using plot_ly
and that may prove easier for some things, for example for making interactive three-dimensional graphs.
Here are two examples that we worked with earlier converted into ggplotly
objects that allow further interrogation of the information displayed:
spp_weight_boxplot <- surveys %>%
ggplot(aes(x = genus, y = weight)) +
geom_violin() +
geom_jitter(color = "tomato", width = 0.2, alpha = 0.2) +
scale_y_log10() +
## log (base 10) transforms the y-axis variable
## (helps to make the plot less skewed)
labs(x = "",
## removes the y-axis label
y = "log10-Weight") +
coord_flip() + #Switches x and y axes
theme(axis.text.y = element_text(size = 12),
axis.text.x = element_text(size = 12),
text = element_text(size = 16)) +
theme_bw()
spp_count_plot <- yearly_counts %>%
ggplot(aes(x = year, y = n, color = genus)) +
geom_line() +
labs(x = "Year",
y = "Abundance",
color = "Genus") +
theme(axis.title.x = element_text(face = "bold", size = 12)) +
theme_bw()
ggplotly(spp_weight_boxplot);ggplotly(spp_count_plot)