Using dplyr with variable column names

The group_by() function from the dplyr package is particularly helpful for viewing, counting, and mutating data based on a given column. Sometimes it’s useful to investigate the data based on multiple columns that each describe the data - ideally using a variable defined as each column of interest. In dplyr, we can leverage both base R and tidyeval to make this work.

To try it, we create a sample dataframe that pairs flower types and colors. We’ll randomly assign each of the plants as “affected” (1) or “unaffected” (0) by some condition.

# Set seed for reproducibility 
set.seed(123)

# Create lists of random data
flowers <- c("Rose", "Tulip", "Daisy", "Lily", "Sunflower")
colors <- c("Red", "Yellow", "White", "Pink", "Orange")

# Choose size of dataframe 
num_rows <- 100  

# Create dataframe with flower-color pairings 
# Note whether each is affected or not 
sample_data <- data.frame(
    flower = sample(flowers, num_rows, replace = TRUE),
    color = sample(colors, num_rows, replace = TRUE),
    affected = sample(0:1, num_rows, replace = TRUE)
)

If we inspect our data with head(sample_data), the console should display:

> head(sample_data)
     flower  color affected
1     Daisy Orange        1
2     Daisy  White        0
3     Tulip    Red        0
4     Tulip Yellow        0
5     Daisy    Red        0
6 Sunflower Yellow        1

There are multiple colors of each flower: the first two rows are daisies, but one is orange and one is white. There are also multiple flowers of each color: rows 4 and 6 include yellow flowers, one of which is a tulip and one of which is a sunflower. Some of these flowers are affected (1), while others are not (0).

We might be interested in how many flowers of each color are affected. To do this, we can use functions from the dplyr package to group by color and display how many of each color is affected:

# Install package if you haven't before
install.packages("dplyr")

# Load package once installed
library(dplyr)

# Count affected flowers, by color 
by_color <- sample_data %>%
    group_by(color) %>%
    summarise(num_affected = sum(affected == 1))

The by_color table includes the number of affected flowers per color:

> by_color
# A tibble: 5 × 2
  color  num_affected
  <chr>         <int>
1 Orange           10
2 Pink              4
3 Red               8
4 White             7
5 Yellow            9

10 orange flowers are affected, 4 pink, etc. A similar result is produced by investigating the flower column:

by_flower <- sample_data %>%
    group_by(flower) %>%
    summarise(num_affected = sum(affected == 1))

This time the number of affected plants is shown by type of flower (7 daisies were affected, 7 lillies, etc.):

> by_flower
# A tibble: 5 × 2
  flower    num_affected
  <chr>            <int>
1 Daisy                7
2 Lily                 7
3 Rose                 9
4 Sunflower            8
5 Tulip                7

If there are multiple columns of descriptive data, it can help to define a function that takes the column of interest as an argument:

# Define function where our arguments are the input data 
# And the column of interest
affected_by_column <- function(data, column_name) {
    output <- data %>% 
        group_by(column_name) %>%
        summarise(num_affected = sum(affected == 1))
}

# Call the function on a given column name
output <- affected_by_column(sample_data, "flower")

However, running the above yields an error:

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `column_name` is not found.

The function is treating the variable column_name as if it were the name of the actual column in the data. But there is no column called “column_name”!

To instead use the variable column_name (i.e., the string value that is assigned to it), you can redefine the function, placing column_name within curly brackets to use the tidyverse {{ operator:

affected_by_column <- function(data, column_name) {
    output <- data %>% 
        group_by({{column_name}}) %>%
        summarise(num_affected = sum(affected == 1))
}

Here, the group_by function executes on the value assigned to the variable column_name, rather than the string “column_name” itself. This syntax is a tidyeval helper; check out the details here.

We anticipate calling the function as usual (output <- affected_by_column(sample_data, "color")) will produce the same output as the first example above: 10 orange, 4 pink, etc. However, output instead contains:

> output
# A tibble: 1 × 2
  `"color"` num_affected
  <chr>            <int>
1 color               38

The function counted the affected flowers through all rows. We can confirm this by manually counting the number of affected flowers, without using our function: table(sample_data$affected) will show 62 total flowers not affected and 38 affected.

To instead make the function use the name of the column, we modify how we call our function, using as.name():

output <- affected_by_column(sample_data, !!as.name("color"))

The base R as.name() function creates a symbol named “color” to represent the variable name. The unquote operator, !!, captures the value of “color” and passes it to the function call, allowing it to be treated as a column name, rather than a string.

This will also work if we define our variable column_name before we call the function:

column_name <- "color"
output <- affected_by_column(sample_data, !!as.name(column_name))

Beyond manually defining the variable column_name, there are multiple options for identifying the column of interest. It can be passed to the script from the command line:

# Get command line arguments
args <- commandArgs(trailingOnly = TRUE)

# Assign column name as the first argument passed to the script
column_name <- args[1]

The column names can also be identified directly from the dataframe itself. In this example, we know that affected is the last column. To grab all column names except the last one:

cols_of_interest <- names(sample_data)[-length(names(sample_data))]

The function can then be called on each of the columns:

# Create empty list for results 
output_list <- list()

# Get number of affected flowers, by each column  
for (i in cols_of_interest) {
    output <- affected_by_column(sample_data, !!as.name(i))
    output_list[[i]] <- output
}

The output will include one tibble for each column in the original data:

> output_list
$flower
# A tibble: 5 × 2
  flower    num_affected
  <chr>            <int>
1 Daisy                7
2 Lily                 7
3 Rose                 9
4 Sunflower            8
5 Tulip                7

$color
# A tibble: 5 × 2
  color  num_affected
  <chr>         <int>
1 Orange           10
2 Pink              4
3 Red               8
4 White             7
5 Yellow            9

This combination of the tidyeval curly-curly {{ operator along with the !!as.name(variable) syntax is useful for considering multiple columns of data while minimizing code redundancy. Happy coding!