Learning Objectives
- Use relational operators
- Join sequences of relational operations together with logical operators
- Create conditional statements (
ifelse()
)- Create
for
loops- Create functions
Orientation of/for the workshop
This workshop assumes some basic familiarity with working in
R
such as what you might obtain in the “Introduction to R” workshop or in a statistics course that usesR
heavily, such as STAT 217 or STAT 411/511. If you have not interacted withR
previously, some of the assumptions of your background for this workshop might be a barrier. We would recommend getting what you can from this workshop and you can always revisit the materials at a later date after filling in some of those basicR
skills. We all often revisit materials and discover new and deeper aspects of the content that we were not in a position to appreciate in a first exposure.In order to focus this workshop on coding, we developed this interactive website for you to play in a set of “sandboxes” and try your hand at implementing the methods we are discussing. When each code chunk is ready to run (all can be edited, many have the code prepared for you), you can click on “Run Code”. This will run
R
in the background on a server. For the “Challenges”, you can get your answer graded although many have multiple “correct” answers, so don’t be surprised if our “correct” answer differs from yours. The “Solution” is also provided in some cases so you can see a solution - but you will learn more by trying the challenge first before seeing the answer. Each sandbox functions independently, which means that you can pick up working at any place in the documents and re-set your work without impacting other work (this is VERY different from howR
usually works!). Hopefully this allows you to focus on the code and what it does… The “Start over” button can be used on any individual sandbox or you can use the one on the left tile will re-set all the code chunks to the original status.These workshops are taught by Greta Linse and Esther Birch and co-organized by the MSU Library, Statistical Consulting and Research Services (SCRS), and the Department of Mathematical Sciences. More details on us and other workshops are available at the end of the session or via https://www.montana.edu/datascience/training/#workshop-recordings.
First, we will start with the building blocks of conditional statements, the relational operator. Next, we will join sequences of relational operations together with logical operators. We will then use these relational operators inside conditional statements (
ifelse()
).Finally, we will dive into two methods to avoid copying and pasting your code numerous times to accomplish the same task. The
for
loop will be introduced for procedures done multiple times in one location in your code, and functions will be introduced for procedures done throughout your code.
Let’s get started!
Refresher
This workshop covers content that will require that we remember how to extract elements from vectors and dataframes. Let’s work through the following warm-ups to refresh how we can use these operations.
Extracting Elements from a Vector
To extract an element from a vector we use the bracket ([]
) notation. Recall, a vector has only one dimension, so inside the brackets goes one number. To extract elements of a vector, we can give their corresponding index, starting from the number one. (R
uses a one-based numbering system.)
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
x[1] ## Extracts the first element of x
x[1:3] ## Extracts a slice of x, from the first index to the third index
x[-2] ## Extracts everything in x BUT the second index
Challenge 1
Why does
R
produce an error for the following code?
x[-1:3]
Extracting Elements from a Dataframe
A dataframe is a list of vectors, where each vector is permitted to have a different data type. Because dataframes have two dimensions (columns and rows), we are able to extract two types of elements.
To extract a column from a dataframe we can use the familiar $
operator:
example_df <- data.frame(A = c(1, 5, 9, 13),
B = c(2, 6, 10, 14),
C = c(3, 7, 11, 15),
D = c(4, 8, 12, 16)
)
example_df$A ## Extracts the A column from the dataframe
This is useful if we only wish to extract a single vector from our dataframe. If we want multiple vectors, then using matrix notation is more simple.
To extract a column from a dataframe using brackets ([row, column]
):
example_df[1, 1] ## Extracts the first row, first column entry
example_df[, 1] ## Extracts EVERY row in the first column
example_df[1, ] ## Extracts EVERY column in the first row
Challenge 2 (Part 1)
I want to extract the 3rd and 4th columns. What is wrong with my code?
## Change the code below to extract the 3rd and 4th columns
example_df[, 3, 4]
example_df[, 3:4]
Challenge 2 (Part 2)
How would you change it to extract these columns? What if I wanted to extract the 2nd and 4th columns?
## Write code to extract the 2nd and 4th columns
example_df[, c(2,4)]
Relational Operators
Relational operators tell how one object relates to another, where the output is a logical value. There are a variety of relational operators that you have used before in mathematics but take on a slightly different feel in data science. We will walk through a few examples of different types of relational operators and the rest will be left as exercises.
w <- 10.2
x <- 1.3
y <- 2.8
z <- 17.5
dna1 <- "attattaggaccaca"
dna2 <- "attattaggaacaca"
Equality & Inequality
This type of operator tells whether an object is equivalent to another object (==
), or its negation (!=
).
dna1 == dna2
dna1 != dna2
Greater & Less
These statements should be familiar, with a bit of a notation twist. To write a strict greater than or less than statement you would use >
or <
in your statement. To add an equality constraint, you would add a =
sign the inequality statement (>=
or <=
).
w > 10
x > y
Inclusion
This type of statement checks if a character, number, or factor are included in a vector. The %in%
operator tells you whether a value is included in an object. These values can be linked together into a vector, and the output will be a vector of logical values.
colors <- c("green", "pink", "red")
"blue" %in% colors
numbers <- c(1, 2, 3, 4, 5)
5 %in% numbers
some_letters <- c("a", "b", "c", "d", "e")
c("a", "b") %in% some_letters
Challenge 3
Write
R
code to see if:
- 2 * x + 0.2 is equal to y
- “hello” is greater than or equal to “goodbye”
- TRUE is greater than FALSE
- dna1 is longer than 5 bases (use nchar() to figure out how long a string is)
# Challenge 3 R code goes here
2 * x + 0.2 == y
"hello" > "goodbye"
TRUE > FALSE
nchar(dna1)
Comparing Decimal Valued Numbers
Why does the output from 2 * x + 0.2 == y
not seem right? What do you think might be going on?
General R
advice is that the ==
operator should only be used for comparisons of integer and Boolean data types. To store decimal valued data types (doubles) R
, like other programming systems, uses a format (binary floating-point) that doesn’t accurately represent a number like 1.3. When the code is interpreted by R
, the “1.3” is rounded to the nearest number in floating-point format, which results in a small rounding error even before the calculation happens.
This is why, general R
advice is to use the all.equal(x, y)
function to check for equality of two doubles. This function, applied to the two arguments, “is a utility to compare R
objects x
and y
testing near equality.” This means that the function allows for a little bit of wiggle room in the rounding errors that are associated with doubles.
#Verify the result using all.equal(..., ...)
all.equal(2 * x + 0.2, y)
Comparing Characters
How did R
know how to compare the words hello
and goodbye
?
"hello" > "goodbye"
# What does this output suggest?
Sys.getlocale(category = "LC_COLLATE")
Challenge 4
What is going on in the code below? Why is
R
giving an error?
some_letters != c("a", "c")
Why does R
give TRUE
as the third element of this vector, when some_letters[3] != "c"
is obviously false?
Recycling
When you use !=
or ==
, R
tries to compare each element of the left argument with the corresponding element of its right argument. Then what happens when you compare vectors of different lengths?
When one vector is shorter than the other, it gets recycled:
In this case, R repeats c("a", "c")
as many times as necessary to match the length of the some_letters
vector. So, we get c("a", "c", "a", "c", "a")
. Since the recycled "a"
doesn’t match the third element of some_letters
, the value of !=
is in fact TRUE
.
Note: We got lucky here! R
output an error because the length of our vectors was not a constant multiple (e.g. 2 times as long). If they were, R
would have carried out the same recycling procedure, but would not have output an error!
This is why the inclusion (%in%
) operator is so great! It does carry out the procedure we wish for R
to do, without any silly recycling! To exclude values you place a !
in front of the entire statement, not directly in front of the %in%
.
!c("a", "b") %in% some_letters
Logical Values
You can also use logical values (TRUE
, FALSE
) to extract elements of a vector.
some_letters[c(TRUE, TRUE, FALSE, FALSE, FALSE)]
## Extracts the first two elements of the some_letters vector
This should give you an intuition as to how we might be able to use the results of relational statements to extract the elements of the vector that satisfy the relation.
Which
The which
statement returns the indices of a vector, where a relational statement is satisfied (evaluates to TRUE
).
x_2 <- c(3, 5, 7, 9, 11, 13, 15)
which(x_2 > 8)
which(x_2 == 7)
## Mini-challenge: How would you use the indices from these "which" statements to
## extract the elements of x_2 that meet the criteria?
Matrices
The above relational and which statements can be applied to a matrix, or to a subset of the matrix (e.g., a vector). The relational and which
statements are applied element wise (a step-by-step progression through the matrix/vector entries).
# The following are dating data, of one person's messages received per day for one week
messages <- data.frame(okcupid = c(16, 9, 13, 5, 2, 17, 14),
match = c(17, 7, 5, 16, 8, 13, 14))
# This makes the data from OkCupid the first column and the data from Match the second column
row.names(messages) <- c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")
Challenge 5
Use the
messages
matrix to return a matrix of logical values which answer the following question:
- For what days were the number of messages at either site greater than 12?
# Create a logical matrix to provide information that would allow you to answer the question:
# For what days were the number of messages at either site greater than 12?
messages > 12
Challenge 6
Use the messages matrix to return the rows of
messages
which answer the following questions (Use rows of the matrix to answer):
when were the messages at OkCupid equal to 13?
when were the messages at OkCupid greater than Match?
# Use rows of the matrix to answer:
# when were the messages at OkCupid equal to 13?
# when were the messages at OkCupid greater than Match?
messages[which(messages[,1] == 13),]
messages[which(messages[,1] > messages[,2]),]
Subset
The subset
command takes in an object (vector, matrix, data frame) and returns the subset of that object where the entries meet the relational conditions specified (evaluates to TRUE
).
x_3 <- c(3, 5, 7, 9, 11, 13, 15)
subset(x_3, x_3 > 6)
Challenge 7
Using the okcupid “data” from above, answer the following question:
- Change the which() statement code to a subset() statement, extracting the days that the number of messages at OkCupid greater than the messages at Match.
# Change the which() statement code to a subset() statement, extracting the days that the number of messages at OkCupid greater than the messages at Match
subset(messages, messages[,1] > messages[,2])
Logicals
These statements allow for us to change or combine the results of the relational statements we discussed before, using and, or, or not.
“and” statements (&
)
These statements evaluate to TRUE
only if every relational statement evaluates to TRUE
.
(3 < 5) & (9 > 7)
would evaluate toTRUE
because both relational statements areTRUE
(3 > 5) & (9 > 7)
, this would evaluate toFALSE
as only one of the relational statements isFALSE
(the first one)
“or” statements (|
)
These statements evaluate to TRUE
if at least one relational statement evaluates to TRUE
.
(3 > 5) | (9 > 7)
would evaluate toTRUE
because one of the relational statements isTRUE
(the second one)
(3 > 5) | (9 < 7)
would evaluate toFALSE
as both relational statements evaluate toFALSE
“not” statements (!
)
These statements convert (negate) the statement it proceeds, changing TRUE
to FALSE
and FALSE
to TRUE
.
is.numeric(5)
would evaluate toTRUE
because 5 is a number
!is.numeric(5)
would evaluate toFALSE
as it negates the statement it proceeds (!TRUE = FALSE
)
NOTE:
The
&&
and||
logical statements do not evaluate the same as their single counterparts. Instead, these logical operators evaluate toTRUE
orFALSE
based only on the first element of the statement, vector, or matrix.
Below are examples of using logicals to evaluate vectors. For example, in the first line of code three statements are checked, TRUE & TRUE
, TRUE & FALSE
, and FALSE & FALSE
. The first statement is “TRUE” since both elements were “TRUE”. The second and third statements are “FALSE” since at least one element was “FALSE”.
c(TRUE, TRUE, FALSE) & c(TRUE, FALSE, FALSE)
c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
c(TRUE, TRUE, FALSE) && c(TRUE, FALSE, FALSE)
Challenge 8
Using the okcupid vector from above, answer the following questions:
1. Is the last day of the week under 5 messages or above 10 messages?
(HINT:last <- tail(messages$okcupid, n=1)
could be helpful)
2. Is the last day of the week between 15 and 20 messages, excluding 15 but including 20?
# Is the last day of the week under 5 messages or above 10 messages?
# (hint: last <- tail(messages$okcupid, n=1) could be helpful)
# Is the last day of the week between 15 and 20 messages, excluding 15 but including 20?
# Make sure you test your code with some other values
last <- tail(messages$okcupid, 1)
last < 5 | last >10
last > 15 & last <= 20 #Or last > 15 & (last < 20 | last == 20)
The subset
command (from before) can also accept more than one relational condition if joined by logicals.
bad_days <- subset(messages, okcupid < 6 | match < 6)
bad_days
good_days <- subset(messages, okcupid > 10 & match > 10)
good_days
Conditional Statements
Conditional statements utilize relational and logical statements to change the results of your R
code. You may have encountered an if else
statement before (or not), but let’s breakdown exactly what R
is doing when it evaluates them.
If Statements
First, let’s start with an if
statement, the often overlooked building block of the if else
statement. The if
statement is structured as follows:
if(condition){
statement
}
- the condition inside the parentheses (a relational statement) is what the computer executes to check its logical value,
- if the condition evaluates to
TRUE
then the statement inside the curly braces ({}
) is output, and
- if the condition is
FALSE
nothing is output.
- if the condition evaluates to
NOTE: In
R
theif
statement, as described above, will only accept a single value (not a vector or matrix).
y <- -3 ## Try changing this number, for example y <- 5
if(y < 0){
"y is a negative number"
}
Challenge 9
Using the
last
number from Challenge 8, write anif
statement that prints “You’re popular!” if the number of messages from okcupid exceeds 10.
# hint: use the last day of the week for okcupid that you made above
last <- tail(messages$okcupid, 1)
last <- tail(messages$okcupid, 1)
if (last > 10){
"You're popular!"
}
If/Else Statements
Since whenever an if
statement evaluates to FALSE
nothing is output, you might see why an else
statement could be beneficial! An else
statement allows for a different statement to be output whenever the if
condition evaluates to FALSE
. The if else
statement is structured as follows:
if(condition){
statement1
}
else{
statement2
}
- again, the
if
condition is executed first,- if it evaluates to
TRUE
then the first statement (statement1
) is output, - if the condition is
FALSE
the computer moves on to theelse
statement, and the second statement (statement2
) is output.
- if it evaluates to
NOTE: In
R
theif else
statement, as described above, will only accept a single value (not a vector or matrix).
y <- -3 ## Try changing this number, for example y <- 5
if(y < 0){
"y is a negative number"
} else{
"y is either positive or zero"
}
NOTE:
R
accepts bothif else
statements structured as outlined above, but alsoif else
statements using the built-inifelse()
function. This function accepts both singular and vector inputs and is structured as follows:
ifelse(condition, statement1, statement2)
where the first argument is the conditional (relational) statement, the second argument is the statement that is evaluated when the condition is TRUE
(statement1
), and the third statement is the statement that is evaluated when the condition is FALSE
(statement2
).
y <- -3 ## Try changing this number, for example y <- 5
ifelse(y < 0, "y is a negative number", "y is either positive or zero")
Challenge 10
Using the “if” statement from Challenge 9, add the following else statement: When the “if”-condition on messages is not met, R prints out “Send more messages!”.
# Challenge: Rewrite the if, else function from above using R's built in ifelse() function.
last <- tail(messages$okcupid, 1)
last <- tail(messages$okcupid, 1)
ifelse(last > 10, "You're popular!", "Send more messages!")
Else/If Statements
On occasion, you may want to add a third (or fourth, or fifth, …) condition to your if else
statement, which is where the else if
statement comes in. The else if
statement is added to the if else
as follows:
if(condition1){
statement1
}
else if(condition2){
statement2
}
else{
statement3
}
The
if
condition (condition1
) is executed first,if it evaluates to
TRUE
then the first statement (statement1
) is output,if the condition is
FALSE
the computer moves on to theelse if
condition,
Now the second condition (
condition2
) is executed,if it evaluates to
TRUE
then the second statement (statement2
) is output,if the condition is
FALSE
the computer moves on to theelse
statement, and
the third statement (
statement3
) is output.
y <- -3 ## Try changing this number, for example y <- 5 and y <- 0
if(y < 0){
"y is a negative number"
} else if(y == 0){
"y is zero"
} else{
"y is positive"}
NOTE: Conditional statements should be written so that the components are mutually exclusive (an observation can only belong to one piece).
The modulo (
%%
) returns the remainder of the division of the number to the left by the number on the right, for example 5 modulo 3 or 5 %% 3 is 2.
x <- 6
if(x %% 2 == 0){
"x is divisible by 2"
} else if(x %% 3 == 0){
"x is divisible by 3"
} else{
"x is not divisible by 2 or 3"
}
Loops
Loops are a popular way to iterate or replicate the same set of instructions many times. It is no more than creating an automated process by organizing a sequence of steps in your code that need to be repeated. In general, the advice of many R
users would be to learn about loops, but once you have a clear understanding of them to instead use vectorization, when your current iteration doesn’t depend on the previous iteration.
Loops will give you a detailed view of the process that is happening and how it relates to the data you are manipulating. Once you have this understanding, you should put your effort into learning about vectorized alternatives as they pay off in efficiency. These loop alternatives (the apply
and purrr
families) will be covered in a later workshop.
Typically in data science, we separate loops into two types. The loops that execute a process a specified number of times, where the “index” or “counter” is incremented after each cycle are part of the for
loop family. Other loops that only repeat themselves until a conditional statement is evaluated to be FALSE
are part of the while
loop family.
For Loops
In the for
loop figure above:
The starting point (black dot) represents the initialization of the object(s) being used in the
for
loop (i.e., the variables).R
requires you to initialize the objects you will use in your loop before you use them, it does not automatically create the objects for you!The diamond shapes represent the repeat condition the computer is required to evaluate. The computer evaluates the conditional statement (
i in sequence
) as eitherTRUE
orFALSE
. In other terms, you are testing if the current value ofi
(the index/counter) is within the specified range of values (sequence
), where this range is either defined in the initialization or directly within the condition statement (like1:100
).The rectangle represents the set of instructions to execute for every iteration. This could be a simple statement, a block of instructions, or another loop (nested loops).
The computer marches through this process until the repeat decision (condition) evaluates to FALSE
(i
is not in sequence
).
Notation
The for
loop repeat statement is placed between parentheses after the for
statement. Directly after the repeat statement, in curly braces, are the instructions that are to be repeated. You will find the formatting that you like best for things such as loops and functions, but here is our preferred syntax:
num <- 100
for(i in 1:num){
statement
}
Alternatively, you could use:
sequence <- 1:100
for(i in sequence){
statement
}
Example:
for(i in 1:10) {
print(i)
}
i ## should be 10 (the last number of the sequence 1:10), unless the loop had issues!
Now, let’s add the corresponding entries of two vectors together!
x <- seq(1, 10, by = 1)
y <- seq(10, 28, by = 2)
## create an empty vector to store added quantities
z <- rep(NA, length(x))
for(i in 1:10) {
# write a loop to store the entries of x added to the entries of y in z
}
z ## examine the resulting vector
Challenge 11
How could the above calculation be done outside of a loop?
# code to execute adding of the x and y vectors from the example above,
# NOT in a loop
z <- x + y
z
NOTE: Like we used in the above
for
loop, it is often useful, to add aprint()
statement inside the loop. This allows for you to verify that the process is executing correctly and can save you some major headaches!
Challenge 12
Recursive
for-loop
!Read in the BlackfootFish dataset and modify the code to write a
for-loop
to find the indices needed to sample every \(7^{th}\) row from the dataset, starting with the \(1^{st}\) row, until you’ve sampled 1200 rows. This would allow us to split the data set into “testing” and “training” samples, possibly using thetesting
data for assessing a model we develop using the “training” data.
BlackfootFish <- read.csv("https://raw.githubusercontent.com/saramannheimer/data-science-r-workshops/master/Intermediate%20R/AY%202020-2021/data/BlackfootFish2.csv", header = TRUE)
n_testing <- 1200
samps <- rep(NA, n_testing)
## Initializing the samps vector, for storing indicies
samps[1] <- 1
## Setting first sample index to 1 (first row)
# Code snippet:
# Create the code for the process will be executed at each step
# e.g., How will you get the next sample after 1?
for(i in 2:#ending index here){
samps[i] <- # process you execute at every index
}
testing <- BlackfootFish[samps, ]
training <- BlackfootFish[-samps, ]
BlackfootFish <- read.csv("https://raw.githubusercontent.com/saramannheimer/data-science-r-workshops/master/Intermediate%20R/AY%202020-2021/data/BlackfootFish2.csv", header = TRUE)
n_testing <- 1200
samps <- rep(NA, n_testing)
## Initializing the samps vector, for storing indices
samps[1] <- 1
## Setting first sample index to 1 (first row)
# Code snippet:
# Create the code for the process will be executed at each step
# e.g., How will you get the next sample after 1?
for(i in 2:n_testing){
samps[i] <- samps[i-1] + 7
}
testing <- BlackfootFish[samps, ]
training <- BlackfootFish[-samps, ]
Before we move on, is there a better way to split a data set into training and testing versions than taking every 7th observation?
Nesting For Loops
The section title suggest that for
loops can also be nested, you’re probably wondering how this would look.
In nested for
loops, you have two indices keeping track of where you are in the loop. The outer loop will have an index, say i
, and the inner loop will have an index, say j
. Let’s look at the nested for
loop below and see if we can figure out how the loop proceeds through the indices.
for(i in 1:5){
for(j in c("a", "b", "c", "d", "e")){
print(paste(i, j))
}
}
QUESTION:
When would you possibly use this in your code?
Well, suppose you wish to manipulate a matrix by setting its elements to specific values, based on their row and column position. You will need to use nested for
loops in order to assign each of the matrix’s entries a value.
The index
i
runs over the rows of the matrixThe index
j
runs over the columns of the matrix
But when should you use a loop? Couldn’t you just repeat the set of instructions the number of times you want to do it?
A rule of thumb is that if you need to perform the same action in one place of your code three or more times, then you are better served by using a loop. A loop makes your code more compact, readable, maintainable, and saves you some typing! If you have the same process multiple places in your code, that’s where functions come in handy!
Functions
What is a function? In rough terms, a function is a set of instructions that you would like repeated (similar to a loop), but are more efficient to be self-contained in a sub-program and called upon when needed. A function is a piece of code that carries out a specific task, accepting arguments (inputs), and returning an output.
Functions gather a sequence of operations into a whole, preserving it for ongoing use. Functions provide:
- a name we can remember and invoke it by,
- relief from the need to remember individual operations,
- a defined set of inputs and expected outputs, and
- rich connections to the larger programming environment.
As the basic building block of most programming languages, user-defined functions constitute “programming” as much as any single abstraction can. If you have written a function, you are a computer programmer!
At this point, you may be familiar with a smattering of R
functions in a few different packages (e.g., mean
, table
, lm
, glm
, str
, tapply
, etc.). Some of these functions take multiple arguments and return a single output, but the best way to learn about the inner workings of functions is to write your own!
User Defined Functions
A function allows for us to repeat several operations with a single command. Take for instance a function which converts heights from feet and inches to centimeters, named feet_inch_to_cm()
.
feet_inch_to_cm <- function(feet, inches) {
inches <- feet*12 + inches
cm <- inches * 2.54
return(cm)
}
Challenge 13
Use the
feet_inch_to_cm
function to convert your height to cm!
## Your code goes here. NOTE, the solution will use a height of 5' 7"
feet_inch_to_cm(5,7)
We define feet_inch_to_cm()
by assigning it to the output of function. The list of argument names are contained within parentheses. Next, the body of the function (the statements that are executed when it runs) is contained within the curly braces ({}
). The statements in the body are indented by two spaces. This makes the code easier to read but does not affect how the code operates.
It is useful to think of creating functions like writing a cookbook.
- Define the “ingredients” that your function needs. In this case, we only need two ingredients to use our function: “feet” and “inches”.
- State what we to do with the ingredients. In this case, we are taking our ingredient and applying a set of mathematical operators to it.
- Declare what the final product (output) of your function will be. This typically lives in a
return()
statement, but we will see other methods to use.
When we call the function, the values we pass to it as arguments are assigned to those variables so that we can use them inside the function. Inside the function, we use a return statement to send a result back to whoever asked for it.
Function Scoping
When writing and using functions, you should be careful to consider where each of the variables you are using are defined. Variables that are defined within a function belong to a local environment, and are not accessible from outside of the function. Take this example:
x <- 10
## globally defined variable
f <- function(){
## locally defined x and y
x <- 1
y <- 2
c(x, y)
}
f()
x
We notice that the globally defined value of x
is not called inside the function, as the function defines x
within itself, so it does not search for a value of x
! Now look at another example, where x
is not defined within the function.
g <- function(){
## locally defined y
y <- 2
c(x, y)
}
x <- 15
g()
x <- 20
g()
Summary
- When you call a function, a new environment is created for the function to do its work in.
- The new environment is populated with the argument values passed in to the function.
- Objects are looked for first in the function environment.
- If objects are not found in the function environment, they are then looked for in the global environment.
- This scoping issue emphasizes the need for using good global variable and data frame names that are informative and explicit, not reused, and also not function or typical argument names.
Why Write a Function?
Similar to loops, you will often find that you have performed the same task multiple times. A good rule of thumb is that if you’ve copied and pasted the same code twice (so you have three copies), you should write a function instead!
Example of Scaling a Variable without a Function
If we wanted to rescale every quantitative variable in a dataset so that the variables have values between 0 and 1. We could use the following formula:
\[y_{scaled} = \frac{y_i - min\{y_1, y_2,..., y_n\}}{max\{y_1, y_2,..., y_n\} - min\{y_1, y_2,..., y_n\}}\]
The following R
code would carry out this rescaling procedure for the length
and weight
columns of the BlackfootFish
data:
BlackfootFish$length_scaled <- (BlackfootFish$length - min(BlackfootFish$length, na.rm = TRUE)) /
(max(BlackfootFish$length, na.rm = TRUE) - min(BlackfootFish$length, na.rm = TRUE))
BlackfootFish$weight_scaled <- (BlackfootFish$weight - min(BlackfootFish$weight, na.rm = TRUE)) /
(max(BlackfootFish$weight, na.rm = TRUE) - min(BlackfootFish$length, na.rm = TRUE))
We could continue to copy and paste for other columns of the data, but you can probably get the idea. This process of duplicating an action multiple times makes it difficult to understand the intent of the process. Additionally, it makes it very difficult to spot the mistakes.
QUESTION:
Did you spot the mistake in the weight conversion?
Often you will find yourself in the position of needing to find a function that performs a specific task, but you do not know of a function or a library that would help you. You could spend time Googling for a solution, but in the amount of time it takes you to find something, it is possible that you could have already written your own function!
Example of Scaling a Variable with a Function
Let’s transform the repeated process above into a rescaling function.
The following snippet of code rescales the length
column to be between 0 and 1:
(BlackfootFish$length - min(BlackfootFish$length, na.rm = TRUE)) /
(max(BlackfootFish$length, na.rm = TRUE) - min(BlackfootFish$length, na.rm = TRUE))
Our goal is to turn this snippet of code into a general rescale function that we can apply to any numeric vector.
- The first step is to examine the process to determine how many inputs there are.
- The second step is to change the code snippet to instead refer to these inputs using temporary variables.
For this process there is one input, the numeric vector to be rescaled (currently BlackfootFish$length
). We instead want to refer to this input using a temporary variable. It is common (in R
) to refer to a generic vector of data as x
.
Challenge 14
- Modify the code snippet above to instead refer to a temporary variable
x
, make sure your code does not depend on a specific dataset.- Save the rescaled vector in a variable called
x_rescaled
# code modification with temporary variable "x"
x <- BlackfootFish$length
x <- BlackfootFish$length
x_rescaled <- (x - min(x, na.rm = TRUE)) /
(max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
But now take a closer look at our process. Is there any duplication? An obvious duplicated statement is min(x, na.rm = TRUE)
. It would make more sense to calculate the minimum of the data once, store the result, and then refer to it when needed. We also notice that we use the maximum value of x
, so we could instead calculate the range of x
and refer to the first (minimum) and second (maximum) elements when they are needed.
Note: The range
function in R
takes a single numeric input and returns the minimum and maximum (so this is not the “range” statistic but the information to calculate it):
range(seq(1, 5))
What should we call this variable containing the range of x
? Some important advice on naming variables in R
:
- Make sure that the name you choose for your variable is not a reserved word in
R
(range
,mean
,sd
,hist
, etc.).- A way to avoid this is to use the help directory to see if that name already exists (
?functionName
).
- A way to avoid this is to use the help directory to see if that name already exists (
- It is possible to overwrite the name of an existing variable, but it is not recommended!
- Variable names cannot begin with a number.
- Keep in mind that capitalization matters in
R
!
We will call this intermediate variable x_range
(for range of the x variable).
Challenge 15
Create another intermediate variable called
x_range
that contains the range of the intermediate variablex
, using therange()
function. Make sure that you specify thena.rm
option to ignore any NAs in the input vector.
x <- BlackfootFish$length
# Create x_range = the range of x
# Rewrite the code snippet from Exercise 14 to now
# refer to x_range instead of min(x) and max(x)
x <- BlackfootFish$length
x_range <- range(x, na.rm = TRUE)
x_rescaled <- (x - x_range[1]) /
(x_range[2] - x_range[1])
How does this process end up with us writing our own function? What do you need in order to write a function?
- The task the function will solve (what you’ve been copying and pasting) and
- The inputs of the function (the intermediate variables)
We now have all these pieces ready to put together. It’s time to write the function!
In R
the function you define will have the following construction:
myFunction <- function(argument1, argument2,...){
body
}
Challenge 16
Using the template above, write a function named
rescale
that rescales a vector to be between 0 and 1. The function should take a single argument,x
.
# Use the function template to create a function named rescale that performs the process outlined above, using the intermediate variables you previously defined
myfunction <- function(arg1,...){
# body
}
rescale <- function(x){
x_range <- range(x, na.rm = TRUE)
x_rescaled <- (x - x_range[1]) / (x_range[2] - x_range[1])
return(x_rescaled)
}
Once you have written your function, the next step is to test it out. The simplest place to start is to set value(s) for the function’s argument(s) (e.g., x <- c(1, 2, 3, 4, 5)
). You can then run the code inside the function, line by line, to make sure that each line successfully executes for this variable. This is what we call unit testing in data science. This is a similar process to testing out a for
loop, by setting the value of the index (i <- 1
) and running the inside of the loop.
Challenge 17
Unit test your function on a simple vector, with the same name as your function’s input. What do you expect the values of the test to be after you input it into your function?
# Copy the body code from the function to unit test the code in your function on a simple vector named x <- c(1, 2, 3, 4, 5)
x <- c(1, 2, 3, 4, 5)
x_range <- range(x, na.rm = TRUE)
x_rescaled <- (x - x_range[1]) / (x_range[2] - x_range[1])
x_rescaled
If your unit test was successful, now you can move on to testing your function on the test data and then with your real data.
Challenge 18
Test your function on the
length
column of the BlackfootFish dataset, saving the results asBlackfootFish$length_scaled
. Inspect the rescaled values and runsummary()
on the new variable. Do they look correct?
# Test your function out on the BlackfootFish data's length variable
BlackfootFish$length_scaled <- rescale(BlackfootFish$length)
summary(BlackfootFish$length_scaled)
Process of Creating a Function
You should never start writing a function with the function template. Instead, you start with a problem that you need to solve and work through the following steps:
Define the problem you need to solve.
Get a working snippet of code that solves the simple problem. (Where you start when you’re copying and pasting)
Rewrite the code snippet to use temporary variables.
Rewrite the code for clarity (remove redundancy or multiple calculations of the same value).
Turn everything into a function! The code snippet forms the body of the function, the temporary variables are the arguments, and you choose the name of your function.
Test your function! Start with a simple example and then go big!
Challenge 19 (Part 1)
Putting it All Together! Start by creating a data subset with just the species, length, and weight variables from the BlackfootFish dataset. Call the data subset
BlackfootFish_subset
.
BlackfootFish_subset <- data.frame(species = BlackfootFish$species, length = BlackfootFish$length, weight = BlackfootFish$weight)
Challenge 19 (Part 2)
Now, given the matrix of species, length, and weights above, use the tools from the workshop (
for
loop, function, conditional, and/or relational statements) to:
- compute Fulton’s condition factor of each fish (\(condition = \frac{100000 \ast weight}{length^3}\)), making any fish with weight or lengths of 0s to be NAs on the condition factor since those are obvious data entry errors.
- then remove the fish from the dataset whose condition index is NA or more than 2 (values larger than 2 are not typical and might suggest further data entry errors).
- If you attended our data visualization workshop, try to make a plot of condition factors based on the fish species with jittered points and violins.
condition_index <- function (x, y) {
ifelse(x==0|y==0, c_in<- NA, c_in <- 100000*(x/(y^3)))
return(c_in)
}
BlackfootFish_subset$condind <- condition_index(BlackfootFish_subset$weight, BlackfootFish_subset$length)
summary(BlackfootFish_subset$condind)
D1 <- subset(BlackfootFish_subset, !is.na(condind) & condind <= 2)
library(ggplot2)
D1 %>%
ggplot(aes(x=species, y=condind)) +
geom_jitter() +
geom_violin() +
theme_bw()
References
R
for Data Science (Grolemund & Wickham, 2017)
- Functions: https://r4ds.had.co.nz/functions.html
- Iteration: https://r4ds.had.co.nz/iteration.html
Advanced R
(Wickham, 2015)
- Subsetting: https://adv-r.hadley.nz/subsetting.html
- Control Flow: https://adv-r.hadley.nz/control-flow.html
- Functions: https://adv-r.hadley.nz/functions.html
- Conditionals: https://adv-r.hadley.nz/conditions.html
- Environments: https://adv-r.hadley.nz/environments.html
Suggestions for your own work
The goal of this workshop was to teach you to write code in R
to learn how to use relational operators, create conditional sequences, and write loops and functions. The first workshop in our series contains more information on how to get started working in R
using RStudio (see http://www.montana.edu/datascience/training/). The second workshop in our series contains information on how to write code to visualize data using ggplot2
. The codechunks in this interactive document mimic the codechunks you can use on your own projects in RMarkdown but you will need to download and install both R
and RStudio on your own computer.
Montana State University R
Workshops Team
These materials were adapted from materials generated by the Data Carpentries (https://datacarpentry.org/) and were originally developed at MSU by Dr. Allison Theobold. The workshop series is co-organized by the Montana State University Library, Department of Mathematical Sciences, and Statistical Consulting and Research Services (SCRS, https://www.montana.edu/statisticalconsulting/). SCRS is supported by Montana INBRE (National Institutes of Health, Institute of General Medical Sciences Grant Number P20GM103474). The workshops for 2021-2022 are supported by Faculty Excellence Grants from MSU’s Center for Faculty Excellence.
Research related to the development of these workshops appeared in:
- Allison S. Theobold, Stacey A. Hancock & Sara Mannheimer (2021) Designing Data Science Workshops for Data-Intensive Environmental Science Research, Journal of Statistics and Data Science Education, 29:sup1, S83-S94, DOI: 10.1080/10691898.2020.1854636
The workshops for 2023-2024 involve modifications of materials and are licensed CC-BY. This work is licensed under a Creative Commons Attribution 4.0 International License.
The workshops for 2023-2024 involve modifications of materials and are being taught by:
Greta Linse
- Greta Linse is the Interim Director of Statistical Consulting and Research Services (https://www.montana.edu/statisticalconsulting/) and the Project Manager for the Human Ecology Learning and Problem Solving (HELPS) Lab (https://helpslab.montana.edu). Greta has been teaching, documenting and working with statistical software including R and RStudio for over 10 years.
Sara Mannheimer
- Sara Mannheimer is an Associate Professor and Data Librarian at Montana State University, where she helps shape practices and theories for curation, publication, and preservation of data. Her research examines the social, ethical, and technical issues of a data-driven world. She is the project lead for the MSU Dataset Search, and she is working on a book about data curation to support responsible qualitative data reuse and big social research.
The materials have also been modified and improved by:
- Dr. Mark Greenwood
- Harley Clifton
- Eliot Liucci
- Dr. Allison Theobold