Orientation of/for the Workshop
This workshop assumes some basic familiarity with working in
R
such as what you might obtain in the “Introduction to R” workshop or in a statistics course that usesR
heavily, such as STAT 217 or STAT 411/511. If you have not interacted withR
previously, some of the assumptions of your background for this workshop might be a barrier. We would recommend getting what you can from this workshop and you can always revisit the materials at a later date after filling in some of those basicR
skills. We all often revisit materials and discover new and deeper aspects of the content that we were not in a position to appreciate in a first exposure.In order to focus this workshop on coding, we developed this interactive website for you to play in a set of “sandboxes” and try your hand at implementing the methods we are discussing. When each code chunk is ready to run (all can be edited, many have the code prepared for you), you can click on “Run Code”. This will run
R
in the background on a server. For the “Challenges”, you can get your answer graded although many have multiple “correct” answers, so don’t be surprised if our “correct” answer differs from yours. The “Solution” is also provided in some cases so you can see a solution - but you will learn more by trying the challenge first before seeing the answer. Each sandbox functions independently, which means that you can pick up working at any place in the documents and re-set your work without impacting other work (this is VERY different from howR
usually works!). Hopefully this allows you to focus on the code and what it does… The “Start over” button can be used on any individual sandbox or you can use the one on the left tile to re-set all the code chunks to the original status.These workshops are taught by Ava Yazdian, Sally Slipher, Greta Linse, and Sara Mannheimer and co-organized by the MSU Library and Social Data Collection and Analysis Services (Social Data). More details on us and other workshops are available at the end of the session or via https://www.montana.edu/datascience/training/#workshop-recordings.
Let’s get started!
Learning Objectives
- Describe the purpose of the
dplyr
andtidyr
packages.- Select certain columns in a data frame with the
dplyr
functionselect
.- Select certain rows in a data frame according to filtering conditions with the
dplyr
functionfilter
.- Link the output of one
dplyr
function to the input of another function with the ‘pipe’ operator%>%
.- Add new columns to a data frame that are functions of existing columns with
mutate
.- Use the split-apply-combine concept for producing data summaries.
- Use
summarize
,group_by
, andcount
to split a data frame into groups of observations, apply summary statistics for each group, and then combine the results.- Describe the concept of a wide and a long table format and for which purpose those formats are useful.
- Describe what key-value pairs are.
- Reshape a data frame from long to wide format and back with the
pivot_wider
andpivot_longer
commands from thetidyr
package.- Export a data frame to a .csv file.
Data Wrangling using dplyr
& tidyr
Intro
Note that we’re not using “data manipulation” for this workshop, but are calling it “data wrangling.” To us, “data manipulation” is a term that captures the event where a researcher manipulates their data (e.g., moving columns, deleting rows, merging data files) in a non-reproducible manner. Whereas, with data wrangling, all of these processes are done, but in a reproducible manner, such as using an R
script!
Packages in R
are sets of additional functions that lets you do more stuff. The functions we’ve been using so far, like str()
or data.frame()
, come built into R
; packages give you access to more of them. Before you use a package for the first time you need to install it on your machine, and then you should import it in every subsequent R
session when you need it. You should already have installed the tidyverse
package. This is an “umbrella-package” that installs several packages useful for data analysis that work together well such as tidyr
, dplyr
, ggplot2
, tibble
, etc.
The tidyverse
package tries to address 3 common issues that arise when doing data analysis with some of the functions that come with R
:
- The results from a base
R
function sometimes depend on the type of data. - Using
R
expressions in a non-standard way, which can be confusing for new learners. - Hidden arguments, having default operations that new learners are not aware of.
We have seen in our previous lesson that when building or importing a data frame, the columns that contain characters (i.e., text) are coerced (=converted) into the factor
data type. We had to set stringsAsFactors
to FALSE
to avoid this hidden argument to convert our data type.
This time we will use the tidyverse
package to read the data and avoid having to set stringsAsFactors
to FALSE
In order to install the tidyverse
package, you can type install.packages("tidyverse")
straight into the RStudio console. In fact, it’s better to write this in the console than in a script for any package, as there’s no need to re-install packages every time we run the script. If you work in a .Rmd (R-markdown) format, any missing packages will be identified and a prompt added to the top of the document about installing the packages.
Then, to load the package we would need to type:
## load the tidyverse packages -- including dplyr, tidyr, readr, stringr
library(tidyverse)
What are dplyr
and tidyr
?
The package dplyr
is built to work directly with data frames, with many common tasks optimized by being written in a compiled language (C++). An additional feature is the ability to work directly with data stored in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query are returned.
This addresses a common problem with R
in that all operations are conducted in-memory and thus the amount of data you can work with is limited by the available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R
only what you need for analysis.
The package tidyr
addresses the common problem of wanting to reshape your data for plotting and use by different R
functions. Sometimes we want data sets where we have one row per measurement. Sometimes we want a data frame where each measurement type has its own column, and rows are instead more aggregated groups - like plots or aquaria. Moving back and forth between these formats is nontrivial, and tidyr
gives you tools for this and more sophisticated data wrangling.
To learn more about dplyr
and tidyr
after the workshop, you may want to check out this handy data transformation with dplyr
cheatsheet and this cheatsheet about tidyr
.
Presentation of the Survey Data
The data used in this workshop are a time-series for a small mammal community in southern Arizona. This is part of a project studying the effects of rodents and ants on the plant community that has been running for almost 40 years, but we will focus on the years 1996 to 2002 (n=11332 observations). The rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots. This is a simplified version of the full data set that has been used in over 100 publications and was provided by the Data Carpentries (https://datacarpentry.org/ecology-workshop/index.html) (Link to data: https://figshare.com/articles/dataset/Portal_Project_Teaching_Database/1314459). We are investigating the animal species diversity and weights found within plots in this workshop. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular plot |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
We’ll read in our data using the read_csv()
function, from the tidyverse package readr
, instead of read.csv()
.
surveys <- read_csv("https://raw.githubusercontent.com/saramannheimer/data-science-r-workshops/master/Data%20Wrangling/AY%202022-2023/data_wrangling_learnr/data/surveys2_subset.csv")
You will see the message Column specification
, followed by each column name and its data type. When you execute read_csv
on a data file, it looks through the first 1000 rows of each column and guesses the data type for each column as it reads it into R
. For example, in this dataset, read_csv
reads weight
as col_double
(a numeric data type), and species
as col_character
. You have the option to specify the data type for a column manually by using the col_types
argument in read_csv
.
## inspect the data
glimpse(surveys)
## Preview the data (opens a spreadsheet-like interface in RStudio)
View(surveys)
The data set is stored as a “tibble”. Tibbles tweak some of the behaviors of the data frame objects we introduced previously. The data structure is very similar to a data frame. For our purposes the only differences are that:
- In addition to displaying the data type of each column under its name, it only prints the first few rows of data and only as many columns as fit on one screen.
- Columns of class
character
are never converted into factors.
We’re going to learn some of the most common dplyr
functions:
select()
: subset columnsfilter()
: subset rows on conditionsmutate()
: create new columns by using information from other columnsgroup_by()
andsummarize()
: create summary statistics on grouped dataarrange()
: sort resultscount()
: count discrete values
Select, Filter, and Mutate
Selecting Columns and Filtering Rows
To select columns of a data frame, use select()
. The first argument to this function is the data frame (surveys
), and the subsequent arguments are the columns to keep.
Modify the following code to select the plot_id
, species_id
, and weight
columns from the survey
dataset:
select(surveys)
select(surveys, plot_id, species_id, weight)
To select all columns except certain ones, put a “-” in front of the variable to exclude it.
Modify the following code to select all columns except record_id
and species_id
:
select(surveys)
select(surveys, -record_id, -species_id)
This will select all the variables in surveys
except record_id
and species_id
.
To choose rows based on specific criteria, use filter()
:
filter(surveys, year == 1999)
In the code above ==
keeps all rows where the year is 1999.
Other filtering options include !=
, which keeps all rows that are not a certain criteria, ,
which means “and”, and |
which means “or”. Filter can also do <
for “less than”, >
for “greater than”, <=
for “less than or equal to”, and >=
for “greater than or equal to”. We type these last two options the same way we would typically say them.
!=
example:filter(surveys, year != 1999)
The code above keeps all rows where the year is not 1999.
,
example:filter(surveys, year == 1999 , plot_id == 2)
The code above keeps all rows where the year is 1999 for plot id 2, i.e., year 1999 and plot 2. The rows meet both of these criteria.
|
example:filter(surveys, year == 1999 | plot_id == 2)
The code above keeps all rows where the year is 1999 or is plot id 2, i.e., year 1999 or plot 2. The rows meet either of these criteria but not both.
<
example:filter(surveys, weight < 8)
The code above keeps all rows where weight is less than 8.
>
example:filter(surveys, hindfoot_length > 30)
The code above keeps all rows where hindfoot length is greater than 30.
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
surveys2 <- filter(surveys, weight < 6)
surveys_sml <- select(surveys2, species_id, sex, weight)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e., one function inside of another), like this:
surveys_sml <- select(filter(surveys, weight < 6), species_id, sex, weight)
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, is a more recent addition to R
. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same dataset. Pipes in R
look like %>%
and are made available via the magrittr
package, installed automatically with dplyr
. If you use RStudio, you can type the pipe with Ctrl + Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
surveys %>%
filter(weight < 6) %>%
select(species_id, sex, weight)
In the above code, we use the pipe to send the surveys
dataset first through filter()
to keep rows where weight
is less than 6, then through select()
to keep only the species_id
, sex
, and weight
columns. Since %>%
takes the object on its left and passes it as the first argument to the function on its right, we don’t need to explicitly include the data frame as an argument to the filter()
and select()
functions any more.
Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we took the data frame surveys
, then we filter
ed for rows with weight < 6
, then we select
ed columns species_id
, sex
, and weight
. The dplyr
functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex wrangling of data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
surveys_sml <- surveys %>%
filter(weight < 6) %>%
select(species_id, sex, weight)
surveys_sml
Note that the final data frame is the leftmost part of this expression.
Challenge 1
Using pipes, subset the
surveys
data to include:
- animals collected on or after 2001 and
- retain only the columns
year
,sex
, andweight
.
## Pipes Challenge:
## Using pipes, subset the data to include animals collected
## on or after 2001, and retain the columns `year`, `sex`, and `weight.`
surveys %>%
filter(year >= 2001) %>%
select(year, sex, weight)
Mutate
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions, or to find the ratio of values in two columns. For this we’ll use mutate()
.
To create a new column of weight in kg from weight in grams:
surveys %>%
mutate(weight_kg = weight / 1000)
You can also create a second new column based on the first new column within the same call of mutate()
:
surveys %>%
mutate(weight_kg = weight / 1000,
weight_lb = weight_kg * 2.2)
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head()
of the data. (Pipes work with non-dplyr
functions, too, as long as the dplyr
or magrittr
package is loaded).
surveys %>%
mutate(weight_kg = weight / 1000) %>%
head()
The first few rows of the data set contain some missing observations (NA
s). If we wanted to remove any observations where there were missing values on weight
, we could insert a filter()
in the chain:
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head()
is.na()
is a function that determines whether something is an NA
. The !
symbol negates the result, so in the code above we’re asking for every row where weight is not an NA
.
Challenge 2
Create a new data frame from the
surveys
data namedsurveys_hindfoot_cm
that meets the following criteria:
- contains only the
species_id
column and- a new column called
hindfoot_cm
containing thehindfoot_length
values converted to centimeters (they are in mm).- Make sure that you only retain values in the hindfoot_cm column that are not missing (not NA) and are less than 3 cm.
- Then print out the
head()
of the new data frame.Hint: think about how the commands should be ordered to produce this data frame!
## Mutate Challenge:
## Create a new data frame from the `surveys` data named `surveys_hindfoot_cm`
## that meets the following criteria:
## * contains only the `species_id` column and
## * a new column called `hindfoot_cm` containing the `hindfoot_length` values
## converted to centimeters.
## * Make sure that you only retain values in the hindfoot_cm column that are
## not missing (not NA) and are less than 3 cm.
## Then print out the head of the new data frame.
## Hint: think about how the commands should be ordered to produce this data frame!
surveys_hindfoot_cm <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
mutate(hindfoot_cm = hindfoot_length/10) %>%
filter(hindfoot_cm < 3) %>%
select(species_id, hindfoot_cm)
surveys_hindfoot_cm %>%
head()
Using lubridate
for Dates
Date-time data can be frustrating to work with in R
, since R
commands for date-times are generally un-intuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R
lacks these capabilities in some situations. The lubridate
package makes it easier to do the things R
does with date-times and possible to do things that base R
does not.
Lubridate
has functions that handle easy parsing of times, such as:
ymd()
dmy()
mdy()
library(lubridate)
today() # Today's date
now() # Today's date, with time and timezone!
surveys_w_days <- surveys %>%
mutate(date = ymd(paste(year,
month,
day,
sep = "-")
),
day_of_week = wday(date, label = TRUE)
## Creating a day of the week variable
## label = TRUE prints the name, not the level!
)
surveys_w_days %>%
head()
surveys_w_days %>%
select(day_of_week) %>%
summary()
surveys_w_days %>%
filter(is.na(date) == TRUE) %>%
select(month, day) %>%
table()
Challenge 3
- What dates were unable to be converted?
- Explore the results and objects in the previous sandbox to figure out why that happened.
We can pull off components of dates using a large array of lubridate
functions, such as:
year()
month()
mday()
hour()
minute()
second()
For additional information about lubridate
visit the lubridate
reference website or look over the lubridate
cheatsheet.
Character Wrangling
If we inspect the day of week variable we created in the last code chunk, we’ll see that it is an ordered (<ord>) factor.
Challenge 4
What are the names of the days of the week taken from the dates?
levels(surveys_w_days$day_of_week)
The case_when()
Function
We notice that the labels for the days of the week are not necessarily what we would like to have for a graphical display of our data. To reword the names of the days of the week, we can use the case_when()
function from dplyr
.
The case_when()
function can be thought of as a “generalized form for multiple if_else()
statements.” We talked about ifelse()
statements in the Intermediate R workshop, but let’s break them down here to review.
For case_when()
the inputs are sequences of two-sided formulas. The left hand side finds the values that match the case and the right hand side says what should be done with these matches.
Let’s look at this in action!
surveys_days_full <- surveys_w_days %>%
mutate(day_of_week = case_when(day_of_week == "Mon" ~ "Monday",
day_of_week == "Tue" ~ "Tuesday",
day_of_week == "Wed" ~ "Wednesday",
day_of_week == "Thu" ~ "Thursday",
day_of_week == "Fri" ~ "Friday",
day_of_week == "Sat" ~ "Saturday",
day_of_week == "Sun" ~ "Sunday")
)
glimpse(surveys_days_full$day_of_week)
NOTE:
If you only want to recode a couple levels of a variable, you can still use
case_when()
without specifying the behavior for ALL levels. See the example below:
# Create a variable weekday that takes on a value of 0 for Saturday/Sunday
# and 1 otherwise and recodes Friday to missing
surveys_weekday <- surveys_w_days %>%
mutate(weekday = case_when(day_of_week == "Sat" ~ 0,
day_of_week == "Sun" ~ 0,
day_of_week == "Fri" ~ as.numeric(NA),
TRUE ~ 1))
surveys_weekday %>%
count(weekday)
But, perhaps these days are not in the order that we want them to be in.
Challenge 5
What order did
R
put the days of the week in? What data type isday_of_week
now?
table(surveys_days_full$day_of_week)
typeof(surveys_days_full$day_of_week)
There are small differences between character data types and factor data types. Typically, R
uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. However, factors are often difficult to work with. Enter the forcats
package, whose goal is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values.
The order of the levels chosen by R
may not be what we wanted, but we can reorder them using the fct_relevel()
function from the forcats
package (the forcats
cheatsheet link. The function takes three arguments:
- the data
- the factor to be reordered
- the order of the new levels separated by commas
This process looks like this:
surveys_edited <- surveys_days_full %>%
mutate(day_of_week = fct_relevel(day_of_week,
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday",
"Sunday")
)
glimpse(surveys_edited$day_of_week)
Challenge 6
Verify that
R
puts the days in the order that you specified!
levels(surveys_edited$day_of_week)
Split-Apply-Combine Data Analysis
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr
makes this very easy through the use of the group_by()
function.
The summarize()
Function
group_by()
is often used together with summarize()
, which collapses each group into a single-row summary of that group. group_by()
takes as arguments the column names that contain the categorical variables for which you want to calculate the summary statistics. So to compute the mean weight
by sex:
surveys_edited %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
One of the advantages of tbl_df
over data frame is that is provides more compact output, although the current format of these materials makes that hard to see.
You can also group by multiple columns:
surveys_edited %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
When grouping both by sex
and species_id
, the last row is for animals that escaped before their sex and body weights could be determined. You may notice that the last column does not contain NA
but NaN
(which refers to “Not a Number”). To avoid this, we can remove the missing values for weight before we attempt to calculate the summary statistics on weight. Because the missing values are removed first, we can omit na.rm = TRUE
when computing the mean:
surveys_edited %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight))
If you want to display more data, you can use the print()
function at the end of your chain with the argument n
specifying the number of rows to display:
surveys_edited %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n = 15)
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
surveys_edited %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight))
It is sometimes useful to rearrange the result of a query to inspect the values. For instance, we can sort on min_weight
to put the lighter species first:
surveys_edited %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(min_weight)
To sort in descending order, we need to add the desc()
function. If we want to sort the results by decreasing order of mean weight:
surveys_edited %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(desc(mean_weight))
Challenge 7 Part 1:
Use
group_by()
andsummarize()
to find the mean, min, and max hindfoot length for each species (usingspecies_id
). Also add the number of observations. (HINT: see?n
.)
surveys_edited %>%
filter(!is.na(hindfoot_length)) %>%
group_by(species_id) %>%
summarize(mean_hf_length = mean(hindfoot_length),
min_hf_length = min(hindfoot_length),
max_hf_length = max(hindfoot_length),
num_obs = n())
Challenge 7 Part 2:
What was the heaviest animal measured in each year?
Return the columns
year
andweight
.
surveys_edited %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
summarize(max_weight = max(weight)) %>%
select(year, max_weight)
Counting
When working with data, we often want to know the number of observations found for each factor or combination of factors. For this task, dplyr
provides count()
. For example, if we wanted to count the number of rows of data for each sex, we would do:
surveys_edited %>%
count(sex)
The count()
function is shorthand for something we’ve already seen: grouping by a variable, and summarizing it by counting the number of observations in that group. In other words, surveys %>% count()
is equivalent to:
surveys_edited %>%
group_by(sex) %>%
summarize(count = n())
For convenience, count()
provides the sort
argument:
surveys_edited %>%
count(sex, sort = TRUE)
The previous example shows the use of count()
to count the number of rows/observations for one factor (i.e., sex
). If we wanted to count the combination of factors, such as sex
and species
, we would specify the first and the second factors as the arguments of count()
:
surveys_edited %>%
count(sex, species_id)
With the above code, we can proceed with arrange()
to sort the table according to a number of criteria so that we have a better way to compare groups. For instance, we might want to arrange the table above in (i) an alphabetical order of the levels of the species and (ii) in descending order of the count:
surveys_edited %>%
count(sex, species_id) %>%
arrange(species_id, desc(n))
From the table above, we may learn that, for instance, there are 72 observations of the albigula species (species_id
= “NL”) for males.
Challenge 8:
How many animals were caught in each plot (
plot_id
) surveyed?
## Count Challenge:
## How many animals were caught in each `plot_type` surveyed?
surveys_edited %>%
count(plot_id)
Relational Data with dplyr
It is rare that data analyses, especially with longitudinal measurements, involve only a single table of data. More typically, you have multiple tables of data, describing different aspects of your study. When you embark on analyzing your data, these different data tables need to be combined. Collectively, multiple tables of data are called relational data, as the data tables are not independent, rather they relate to each other.
Relations are defined between a pair of data tables. There are three families of joining operations: mutating joins, filtering joins, and set operations. Today we will focus on mutating joins.
The survey
data have two other data tables they are related to: plots
and species
. Load in these data and inspect them to get an idea of how they relate to the survey
data we’ve been working with.
plots <- read_csv("https://raw.githubusercontent.com/saramannheimer/data-science-r-workshops/master/Data%20Wrangling/AY%202020-2021/data_wrangling_learnr/data/plots.csv")
head(plots)
Column | Description |
---|---|
plot_id | ID of a particular plot |
plot_type | type of plot |
species <- read_csv("https://raw.githubusercontent.com/saramannheimer/data-science-r-workshops/master/Data%20Wrangling/AY%202020-2021/data_wrangling_learnr/data/species.csv")
head(species)
Column | Description |
---|---|
species_id | 2-letter code |
genus | genus of animal |
species | species of animal |
taxon | e.g. Rodent, Reptile, Bird, Rabbit |
The variables used to connect a pair of tables are called keys. A key is a variable that uniquely identifies an observation in that table. What are the keys for each of the three data tables? (hint: What combination of variables uniquely identifies a row in that data frame?)
There are two types of keys:
- A primary key uniquely identifies an observation in its own table.
- A foreign key uniquely identifies an observation in another table.
A primary key and the corresponding foreign key form a relation between the two data tables. These relations are typically many-to-one, though they can be 1-to-1. For example, there are many rodents captured that are of one species_id, hence a many-to-one relationship.
For me, the easiest way to think about the relationships between the different data tables is to draw a picture:
Joining Relational Data
The tool that we will be using is called a mutating join. A mutating join is how we can combine variables from two tables. The join matches observations by their keys, and then copies variables from one table to the other. Similar to mutate()
these join functions add variables to the right of the existing data frame, hence their name. There are two types of mutating joins, the inner join and the outer join.
Inner Join
The simplest join is an inner join, which creates a pair of observations whenever their keys are equal. This join will output a new data frame that contains the key, the values of x
, and the values of y
. Importantly, this join deletes observations that do not have a match.
Outer Join
While an inner join only keeps observations with keys that appear in both tables, an outer join keeps observations that appear in at least one of the data tables. When joining x
with y
, there are three types of outer join:
- A left join keeps all of the observations in
x
.
- A right join keeps all of the observations in
y
.
- A full join keeps all of the observations in both
x
andy
.
The left join is the most common, as you typically have a data frame (x
) which you wish to add additional information to (the contents of y
). This join will preserve the contents of x
, even if there is not a match for them in y
.
Joining surveys_edited
Data
To join the surveys_edited
data with the plots
data and species
data, we will need two join statements. As we are interested in adding this information to our already existing data frame, surveys_edited
, a left join is the most appropriate.
combined <- surveys_edited %>%
left_join(plots, by = "plot_id") %>% # adding the type of plot
left_join(species, by = "species_id") # adding the genus, species, and taxa
glimpse(combined)
If the keys being used have different names in the data tables, you can use by=c("a" = "b")
where a
is the key name in the x
data set and b
is the name in the y
data set. Or you could mutate the variable names so that they do match prior to using left_join
.
Reshaping Data
Data Carpentry’s spreadsheet lesson (link), discusses how to structure our data leading to the four rules defining a tidy dataset:
- Each variable has its own column
- Each observation has its own row
- Each value must have its own cell
- Each type of observational unit forms a table
Here we examine the fourth rule: Each type of observational unit forms a table.
In surveys_edited
, the rows of surveys_edited
contain the values of variables associated with each record (the unit), values such as the weight or sex of each animal associated with each record. What if instead of comparing records, we wanted to compare the different mean weights of each genus between plots? (Ignoring plot_type
for simplicity).
We’d need to create a new table where each row (the unit) is comprised of values of variables associated with each plot. In practical terms this means the values in genus
would become the names of column variables and the cells would contain the values of the mean weight observed on each plot.
Having created a new table, it is therefore straightforward to explore the relationship between the weight of different genera within, and between, the plots. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: average genus weight per plot instead of recordings per date.
The opposite transformation would be to transform column names into values of a variable.
We can do both of these transformations with two tidyr
functions, pivot_longer()
and pivot_wider()
.
Pivoting to a Wider Table
pivot_wider()
takes three principal arguments:
- the data
- the column whose values will become new column names.
- the column whose values will fill the new columns.
Further arguments include fill
which, if set, fills in missing values with the value provided.
Let’s use pivot_wider()
to transform surveys to find the mean weight of each genus in each plot over the entire survey period. We use filter()
, group_by()
, and summarize()
to filter our observations and variables of interest, and create a new variable for the mean_weight
. We use the pipe as before too.
surveys_gw <- combined %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarize(mean_weight = mean(weight))
glimpse(surveys_gw)
surveys_gw %>%
head()
This yields surveys_gw
where the observations for each plot are spread across multiple rows, 164 observations of 3 variables.
Using pivot_wider()
to pivot on genus
with values from mean_weight
this becomes 24 observations of 9 variables, one row for each plot. We again use pipes:
surveys_wide <- surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight)
glimpse(surveys_wide)
surveys_wide %>%
head()
Challenge 9:
Pivot the
combined
data frame to a wide format, withyear
as columns,plot_id
as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the functionn_distinct()
to get the number of unique genera within a particular chunk of data. It’s a powerful function! See?n_distinct
or go to https://dplyr.tidyverse.org/reference/n_distinct.html for more information.Save the wide dataset as an object, with an intuitive name! Then use
glimpse
to take a look at the structure.
## Make a wide data frame by pivoting on year.
## Fill the values in these columns with the number of genera per plot.
## Make sure to save the new dataset with an intuitive name!
surveys_wide_genera <- combined %>%
group_by(plot_id, year) %>%
summarize(num_genera = n_distinct(genus)) %>%
pivot_wider(names_from = year, values_from = num_genera)
glimpse(surveys_wide_genera)
Pivoting to a Longer Table
The opposing situation could have occurred if we had been provided with data in the form of surveys_wide
, where the genus names are column names, but we wish to treat them as values of a genus variable instead. This task is extremely common in longitudinal data where the columns are the measurement events over time on the same variable and the rows are for the locations or subjects and we want to align all the responses in one long vector for plotting (e.g., ggplot
) or analyses.
In this situation we are gathering the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.
pivot_longer()
takes four principal arguments:
- the data
- the columns we wish to pivot into a single column
- the name of the new column to create to store the names of each selected column
- the name of the new column to create to store the data filled in each cell
To recreate surveys_gw
from surveys_wide
we would create a key called genus
and value called mean_weight
and use all columns except plot_id
for the key variable. Here we drop the plot_id
column with a minus sign.
surveys_long <- surveys_wide %>%
pivot_longer(cols = -plot_id, names_to = "genus", values_to = "mean_weight")
glimpse(surveys_long)
Note that now the NA
genera are included in the re-gathered format. Pivoting your data to a wide format and then pivoting to a long format can be a useful way to balance out a dataset so every replicate has the same composition and you can see where you could have obtained observations.
We could also have used a specification for what columns to include. This can be useful if you have a large number of identifying columns, and it’s easier to specify what to gather than what to leave alone. And if the columns are in a row, we don’t even need to list them all out - just use the :
operator!
surveys_wide %>%
pivot_longer(cols = Chaetodipus:Sigmodon, names_to = "genus",
values_to = "mean_weight") %>%
head()
Challenge 10
Take the
surveys_wide_genera
dataset and usepivot_longer()
to pivot it to the long format it was in before, so that each row is a uniqueplot_id
byyear
combination.HINT: The year column names look like numbers so you need to use back ticks (“`”) to indicate they are variables instead of numbers.
## Now take the surveys_wide_genera dataset, and make it long again, by
## (re)pivoting on the year columns.
names(surveys_wide_genera)
## Now take the wide dataset, and make it long again, by (re)pivoting on the
## year columns.
names(surveys_wide_genera)
surveys_wide_genera %>%
pivot_longer(cols=`1996`:`2002`, names_to = "year",
values_to = "num_genera")
Challenge 11 Part 1:
The
combined
data set has two measurement columns:hindfoot_length
andweight
. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types.Let’s walk through a common solution for this type of problem.
First, use
pivot_longer()
to create a dataset calledcombined_longer
where we have a names column calledmeasurement
and a values column that takes on the value of eitherhindfoot_length
orweight
.HINT: You’ll need to specify which columns to pivot into a longer format!
## Use pivot_long() to create an even longer dataset.
## Create a column called measurement, containing the hindfoot and weight columns
## And a value column that takes on the value of either of these measurements
## Hint: You'll need to specify which columns are being used to pivot!
combined_longer <- combined %>%
pivot_longer(cols = c(hindfoot_length, weight), names_to = "measurement",
values_to = "values")
glimpse(combined_longer)
Challenge 11 Part 2:
With this new data set,
combined_longer
, calculate the average of eachmeasurement
in eachyear
for each differentplot_type
.Then pivot these summaries into a data set with a column for
hindfoot_length
andweight
.HINT: This sounds like you want to pivot the data to be a wider format!
## With this new very long data set, calculate the average of each
## measurement in each year for each different plot_type.
## Now pivot these summaries into a wide data set.
## With a columns for hindfoot_length and weight.
## Filled with the summary values you calculated.
measurement_averages <- combined_longer %>%
group_by(year, plot_type, measurement) %>%
summarize(avg_measure = mean(values))
measurement_avg_wide <- measurement_averages %>%
pivot_wider(names_from = measurement, values_from = avg_measure)
Exporting Data
Now that you have learned how to use dplyr
to extract information from or summarize your raw data, you may want to export these new data sets to share them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV files into R
, there is a write_csv()
function that generates CSV files from data frames.
Before using write_csv()
, it is good to create a new folder, data
, in our working directory that will store the generated datasets. It is best to avoid writing generated datasets in the same directory as our raw data as that may create confusion later about which data set was the source and which was the “wrangled” version. So it is good practice to keep them separate. The data_raw
folder should only contain the raw, unaltered data, and should be left alone to make sure we don’t delete or modify it. In contrast, our script will generate the contents of the data
directory, so even if the files it contains are deleted, we can always re-generate them. If you were working in RStudio, the following code would create a data
directory inside the folder where your .Rmd exists.
if(!dir.exists("data")){dir.create("data")}
For future use, we might want to prepare a cleaned up version of the dataset that doesn’t include any missing data.
Let’s start by removing observations of animals for which weight
or hindfoot_length
is missing, or the sex
has not been determined:
surveys_complete <- surveys_edited %>%
filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex
glimpse(surveys_complete)
If we were interested in plotting how species abundances have changed through time, we might also want to remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first we are going to create a data set that counts how often each species has been observed, and filter out the rare species; then, we will extract only the observations for these more common species:
## Extract the most common species_id
species_counts <- surveys_complete %>%
count(species_id) %>%
filter(n >= 50)
## Only keep the most common species
surveys_complete_subset <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
## using the relational operator %in%
glimpse(surveys_complete_subset)
We can check that surveys_complete_subset
has 11266 rows and 11 columns by typing dim(surveys_complete_subset)
in the previous sandbox.
Now that our data set is ready, we can save it as a CSV file in our data
folder.
write_csv(surveys_complete_subset, path = "data/surveys_complete_subset.csv")
Happy wrangling!
Additional Practice
Challenge 11 Bonus:
If you attended the Data Visualization workshop, make a plot of average hindfoot_lengths and weights with colors for the points based on the plot_type.
measurement_avg_wide %>%
ggplot(aes(x=hindfoot_length, y=weight, colour=plot_type)) +
geom_jitter() +
theme_bw()
Other Workshops in the Series
The goal of this workshop was to teach you to write code in R
to perform data wrangling in a reproducible fashion. Recordings of the previous workshops are available at http://www.montana.edu/datascience/training/. We plan to offer this same series in the spring (dates to be determined) so these web-interface “Shiny” apps will continue to evolve and links might change. If they are not available when you try to revisit them, please contact one of the authors and we can point you to the current versions of them.
The first workshop in our series contains more information on how to get started working in R
using RStudio. The second workshop contains information on how to write code to visualize data using ggplot2
. The third workshop, Intermediate R, explored more sophisticated R code involving logicals, loops, and functions. The code chunks in this interactive document mimic the code chunks you can use on your own projects in RMarkdown but you will need to download and install both R
and RStudio on your own computer.
Montana State University R
Workshops Team
These materials were adapted from materials generated by the Data Carpentries (https://datacarpentry.org/) and were originally developed at MSU by Dr. Allison Theobold. The workshop series is co-organized by the Montana State University Library and Social Data Collection and Analysis Services (Social Data) which is an MSU Core Facility and also part of the Data Science Core for Montana INBRE. Social Data is supported by Montana INBRE (National Institutes of Health, Institute of General Medical Sciences Grant Number P20GM103474).
Research related to the development of these workshops appeared in:
- Allison S. Theobold, Stacey A. Hancock & Sara Mannheimer (2021) Designing Data Science Workshops for Data-Intensive Environmental Science Research, Journal of Statistics and Data Science Education, 29:sup1, S83-S94, DOI: 10.1080/10691898.2020.1854636
The workshops for 2024-2025 involve modifications of materials and are licensed CC-BY. This work is licensed under a Creative Commons Attribution 4.0 International License.
The workshops for 2024-2025 involve modifications of materials and are being taught by:
Ava Yazdian
- Ava Yazdian is a current sophomore studying conservation biology and ecology. She enjoys helping others learn the language of statistics and data analysis through her experience in her statistics coursework. Her interests include plant ecology, skiing, and ceramics.
Greta Linse
- Greta Linse is the Facility Manager of Social Data Collection and Analysis Services (https://www.montana.edu/socialdata/) among other on campus roles. Greta has been teaching, documenting, and working with statistical software including R and RStudio for over 10 years.
Sally Slipher
- Sally Slipher is a research statistician for Social Data. She has taught statistics in the past and uses R extensively (and sometimes other coding languages) to explore data and put together analyses.
Sara Mannheimer
- Sara Mannheimer is an Associate Professor and Data Librarian at Montana State University, where she helps shape practices and theories for curation, publication, and preservation of data. Her research examines the social, ethical, and technical issues of a data-driven world. She is the project lead for the MSU Dataset Search and the Responsible AI in Libraries and Archives project. Her 2024 book, Scaling Up, explores how data curation can address epistemological, ethical, and legal issues in qualitative data reuse and big social research.
The materials have also been modified and improved by:
- Dr. Mark Greenwood
- Harley Clifton
- Eliot Liucci
- Dr. Allison Theobold