Ben Stenhaug
A friend of mine was recently complaining that R is slow. It’s true that R isn’t the fastest language, but it’s also true that the gap between someone’s slow R code and the optimal code in another language is usually mostly bridged by writing better R code.
Let’s take a look by diving into his situation and the code that he was finding frustratingly slow.
Part 1: The Example
My friend gave me the following description of his problem:
“One file is one line per student demographic data in the entire district. the other is a transcript file that’s one line per course taken by each student in the district over their entire time in the district. So there are ~7500 students, but the transcript file has ~325,000 entries. my goal is to make a dichotomous variable for each math (~850 individual classes) and ELA class (~50 individual classes) offered in the district, add them as columns to the demographic dataset and give a student a 1 or a 0 if they took each class.”
He also sent me his current code:
# note this function takes two data frames, demographics and transcript, as inputs which I don't have
get_student_courses_with_loops <- function(demographics, transcript){
math.courses <- unique(transcript$class_name[transcript$credit_type=="math"])
ela.courses <- unique(transcript$class_name[transcript$credit_type=="english"])
courses <- unlist(list(math.courses,ela.courses))
classes <- matrix(NA, length(demographics$student_number), length(courses))
colnames(classes) <- courses
for (i in 1:length(demographics$student_number)){
class.list <- transcript$class_name[transcript$student_number == demographics$student_number[i]]
for (j in 1:ncol(classes)){
classes[i,j] <- ifelse(sum(class.list == colnames(classes)[j]) >= 1, 1, 0)
}
}
credits.class <- cbind(demographics, classes) %>% as_tibble()
credits.class
}
Part 2: Let’s try to write faster code
The first thing I notice when looking at his code is that he approached the problem — with nested loops and the necessary logic — in a way that works well for other languagues, but isn’t the best for R.
In general, R is best if you can write vectorized code as opposed to loops. Vectorized code means that the operation happens in parallel element-wise. For example, c(1, 2) + c(3, 4) = c(4, 6) — you can read more about vectorization here.
It’s also beneficial to keep data in rectangular data frames as opposed to more complicated nested structures. This is because rectangular data frames are easy to reason about (it’s just like a spreadsheet!), it allows for vectorization because each column of a data frame is a vector, and the beloved tidyverse packages are designed to work with rectangular data frames.
Creating small example data
I don’t have access to his data. So I create the simplest possible data with the necessary characteristics that I can play around with when I write code. This is similar to the idea of a reprex:
library(tidyverse)
demographics <-
tribble(
~student_number, ~age,
1, 17,
2, 18,
3, 17
)
transcript <-
tribble(
~student_number, ~class_name, ~credit_type,
1, "algebra", "math",
1, "poetry", "english",
2, "poetry", "english",
3, "algebra", "math"
)
Writing a tidy function
Now I can work to solve the problem. I’ll first write code that works, then use that code to make a function.
My approach is to first make a single variable for class which is the class name (algebra for example) followed by the credit type (math or english). For example, an algebra class which is a math credit turns into algebra_math. I also add an indicator column with every value of 1 to encode that the student took that class:
transcript_single_class_var <-
transcript %>%
mutate(
class = paste0(class_name, "_", credit_type),
indicator = 1
) %>%
print() # this is a little trick to both print and assign results
## # A tibble: 4 x 5
## student_number class_name credit_type class indicator
## <dbl> <chr> <chr> <chr> <dbl>
## 1 1 algebra math algebra_math 1
## 2 1 poetry english poetry_english 1
## 3 2 poetry english poetry_english 1
## 4 3 algebra math algebra_math 1
Now I select just the columns I need and spread the data to be “wide” instead of “long” so that each class is its own column. You can read more about the spread function here. I use fill = 0 to encode that students who didn’t have an indicator of 1 for a class did not take that class.
transcript_wide <-
transcript_single_class_var %>%
select(student_number, class, indicator) %>%
spread(class, indicator, fill = 0) %>%
print()
## # A tibble: 3 x 3
## student_number algebra_math poetry_english
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 2 0 1
## 3 3 1 0
Now that transcript_wide
has a row for each student_number
, all that’s left to do is to join it with demographics
. You can read more about joins in R here.
demographics %>% left_join(transcript_wide, by = "student_number")
## # A tibble: 3 x 4
## student_number age algebra_math poetry_english
## <dbl> <dbl> <dbl> <dbl>
## 1 1 17 1 1
## 2 2 18 0 1
## 3 3 17 1 0
Great! The code works! Now let’s use the code to create a get_student_courses_tidy
function.
get_student_courses_tidy <- function(demographics, transcript){
transcript_single_class_var <-
transcript %>%
mutate(
class = paste0(class_name, "_", credit_type),
indicator = 1
)
transcript_wide <-
transcript_single_class_var %>%
select(student_number, class, indicator) %>%
spread(class, indicator, fill = 0)
demographics %>% left_join(transcript_wide, by = "student_number")
}
Part 3: Comparing speeds
Creating large example data
My friend’s data has about 7,500 students and 325,000 transcript entries across about 850 math classes and 50 english classes. I’ll make example data with only 2 classes, but I’ll crank the number of students up to 150,000 to help see the speed difference.
n_students <- 150000
class_combinations <- list("none", "algebra", "poetry", c("algebra", "poetry"))
demographics <-
data_frame(
student_number = 1:n_students,
age = sample(17:18, size = n_students, replace = TRUE)
)
transcript <-
data_frame(
student_number = 1:n_students,
class_combination = sample(1:4, size = n_students, replace = TRUE)
) %>%
mutate(
class_name = class_combinations[class_combination]
) %>%
unnest() %>%
select(-class_combination) %>%
filter(class_name != "none") %>%
mutate(
credit_type = class_name %>% recode("algebra" = "math", "poetry" = "english")
)
Timing the code
I use the tictoc
package to compare the speeds. Indeed, the tidy function is significantly faster. And at larger scales, the difference probably gets even bigger. For tidy methods, more data often doesn’t add much time. However, for looping methods, more data might require many, many more loops which could take a while.
# the looping function takes a while
tictoc::tic()
get_student_courses_with_loops(demographics, transcript)
## # A tibble: 150,000 x 4
## student_number age algebra poetry
## <int> <int> <dbl> <dbl>
## 1 1 18 1 0
## 2 2 17 1 1
## 3 3 18 0 0
## 4 4 18 0 0
## 5 5 17 0 1
## 6 6 18 1 0
## 7 7 17 0 0
## 8 8 17 0 1
## 9 9 17 1 0
## 10 10 17 0 0
## # ... with 149,990 more rows
tictoc::toc()
## 465.395 sec elapsed
# the tidy function is much faster!
tictoc::tic()
get_student_courses_tidy(demographics, transcript)
## # A tibble: 150,000 x 4
## student_number age algebra_math poetry_english
## <int> <int> <dbl> <dbl>
## 1 1 18 1 0
## 2 2 17 1 1
## 3 3 18 NA NA
## 4 4 18 NA NA
## 5 5 17 0 1
## 6 6 18 1 0
## 7 7 17 NA NA
## 8 8 17 0 1
## 9 9 17 1 0
## 10 10 17 NA NA
## # ... with 149,990 more rows
tictoc::toc()
## 0.321 sec elapsed
If you’d like to get better at doing data science in R, the book R for Data Science is a great resource. And if you’re at Stanford, you can always stop by walk-in consulting.