Will Hipson, PhD

Musings on data science, statistics, and programming

Why I love dplyr's across

Very often I find myself in a situation where I need to perform the same operation on multiple columns in a data set. For this, I turn to none other than dplyr’s across function. But as we’ll see, not only does across help when we are interactively wrangling data, it also operates seamlessly within R functions. Here, I’ll showcase a few simple use cases for across.

How to use across

Let’s look at the most basic usage of across. For this post I’ll use the animal crossing items data set featured on TidyTuesday week 19 of 2020. We get the packages we

library(dplyr) # for across and other data wrangling functions
library(readr) # for read_csv

ac_items <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/items.csv')

ac_items
## # A tibble: 4,565 x 16
##    num_id id       name    category orderable sell_value sell_currency buy_value
##     <dbl> <chr>    <chr>   <chr>    <lgl>          <dbl> <chr>             <dbl>
##  1     12 3d-glas… 3D Gla… Accesso… NA               122 bells               490
##  2     14 a-tee    A Tee   Tops     NA               140 bells               560
##  3     17 abstrac… Abstra… Wallpap… TRUE             390 bells              1560
##  4     19 academy… Academ… Dresses  NA               520 bells              2080
##  5     20 acantho… Acanth… Fossils  FALSE           2000 bells                NA
##  6     21 accesso… Access… Furnitu… TRUE             375 bells              1500
##  7     23 acid-wa… Acid-w… Tops     TRUE             420 bells              1680
##  8     24 acid-wa… Acid-w… Bottoms  TRUE             330 bells              1320
##  9     26 acnh-ni… Acnh N… Furnitu… TRUE            8990 bells             35960
## 10     26 acnh-ni… Acnh N… Furnitu… TRUE            8990 bells             35960
## # … with 4,555 more rows, and 8 more variables: buy_currency <chr>,
## #   sources <chr>, customizable <lgl>, recipe <dbl>, recipe_id <chr>,
## #   games_id <chr>, id_full <chr>, image_url <chr>

There are two columns related to currency: sell_value and buy_value. I want to quickly get the mean of these columns for each category. First, here’s how I might do this without across:

ac_items %>%
  group_by(category) %>%
  summarise(sell_value = mean(sell_value, na.rm = TRUE),
            buy_value = mean(buy_value, na.rm = TRUE))
## # A tibble: 21 x 3
##    category    sell_value buy_value
##    <chr>            <dbl>     <dbl>
##  1 Accessories       637.     1912.
##  2 Bottoms           298.     1198.
##  3 Bugs             2220.    10511.
##  4 Dresses          1114.     3804.
##  5 Fish             3809.      NaN 
##  6 Flooring         1275.     3833.
##  7 Flowers           121.      160 
##  8 Fossils          3503.      NaN 
##  9 Fruit             125       400 
## 10 Furniture        5387.    20256.
## # … with 11 more rows

Which works fine. But imagine if instead of two columns there were 10 or 20 or 100! It would quickly get tedious to add a new line for each column. Here’s where across comes in:

ac_items %>%
  group_by(category) %>%
  summarise(across(c(sell_value, buy_value), mean, na.rm = TRUE))
## # A tibble: 21 x 3
##    category    sell_value buy_value
##    <chr>            <dbl>     <dbl>
##  1 Accessories       637.     1912.
##  2 Bottoms           298.     1198.
##  3 Bugs             2220.    10511.
##  4 Dresses          1114.     3804.
##  5 Fish             3809.      NaN 
##  6 Flooring         1275.     3833.
##  7 Flowers           121.      160 
##  8 Fossils          3503.      NaN 
##  9 Fruit             125       400 
## 10 Furniture        5387.    20256.
## # … with 11 more rows

Much more efficient. We give across a vector of column names followed by the function (in this case mean) followed by any other arguments we want to apply to the function.

Using ~ and .x

If we want to provide a lambda function, we use what’s called purrr syntax. In this example, let’s say we want to divide each price by the maximum price in each category. We’ll use ~ to indicate that we’re supplying a lambda function and use .x to indicate where the variable in across is used.

ac_items %>%
  group_by(category) %>%
  mutate(across(c(sell_value, buy_value), ~ .x / max(.x, na.rm = TRUE),
                .names = "{col}_prop")) %>%
  select(category, ends_with("prop")) # to show new cols
## # A tibble: 4,565 x 3
## # Groups:   category [21]
##    category    sell_value_prop buy_value_prop
##    <chr>                 <dbl>          <dbl>
##  1 Accessories         0.0113         0.0113 
##  2 Tops                0.035          0.0833 
##  3 Wallpaper           0.00886        0.00886
##  4 Dresses             0.0065         0.0065 
##  5 Fossils             0.333         NA      
##  6 Furniture           0.0015         0.00469
##  7 Tops                0.105          0.25   
##  8 Bottoms             0.307          0.307  
##  9 Furniture           0.0360         0.112  
## 10 Furniture           0.0360         0.112  
## # … with 4,555 more rows

Changing column names

What if we want to change the names of our new columns? We can do this using the .names argument.

ac_items %>%
  group_by(category) %>%
  summarise(across(c(sell_value, buy_value), mean, na.rm = TRUE, .names = "{col}_mean"))
## # A tibble: 21 x 3
##    category    sell_value_mean buy_value_mean
##    <chr>                 <dbl>          <dbl>
##  1 Accessories            637.          1912.
##  2 Bottoms                298.          1198.
##  3 Bugs                  2220.         10511.
##  4 Dresses               1114.          3804.
##  5 Fish                  3809.           NaN 
##  6 Flooring              1275.          3833.
##  7 Flowers                121.           160 
##  8 Fossils               3503.           NaN 
##  9 Fruit                  125            400 
## 10 Furniture             5387.         20256.
## # … with 11 more rows

.names uses glue syntax, whereby anything inside the curly braces is a variable. In the case, the name of each column we provide is substituted in {col}.

Multiple functions

What if we want not just the mean, but the standard deviation? across can take multiple functions as a list. See here:

ac_items %>%
  group_by(category) %>%
  summarise(across(c(sell_value, buy_value),
                   list(mean = mean, sd = sd), na.rm = TRUE, .names = "{col}_{fn}"))
## # A tibble: 21 x 5
##    category    sell_value_mean sell_value_sd buy_value_mean buy_value_sd
##    <chr>                 <dbl>         <dbl>          <dbl>        <dbl>
##  1 Accessories            637.        1192.           1912.        4384.
##  2 Bottoms                298.         116.           1198.         463.
##  3 Bugs                  2220.        3209.          10511.       14240.
##  4 Dresses               1114.        5247.           3804.       21159.
##  5 Fish                  3809.        4586.            NaN           NA 
##  6 Flooring              1275.        3506.           3833.       10975.
##  7 Flowers                121.         188.            160            0 
##  8 Fossils               3503.        1413.            NaN           NA 
##  9 Fruit                  125           58.4           400            0 
## 10 Furniture             5387.       17604.          20256.       40411.
## # … with 11 more rows

Match column names with tidyselect

If we have lots of columns to operate over, it can be cumbersome to spell out each name. We can leverage tidyselect helpers to match columns by name or type. Continuing with our example, let’s again calculate the mean sell and buy value by category, but we’ll use contains to fetch columns containing value.

ac_items %>%
  group_by(category) %>%
  summarise(across(contains("value"), mean, na.rm = TRUE, .names = "{col}_"))
## # A tibble: 21 x 3
##    category    sell_value_ buy_value_
##    <chr>             <dbl>      <dbl>
##  1 Accessories        637.      1912.
##  2 Bottoms            298.      1198.
##  3 Bugs              2220.     10511.
##  4 Dresses           1114.      3804.
##  5 Fish              3809.       NaN 
##  6 Flooring          1275.      3833.
##  7 Flowers            121.       160 
##  8 Fossils           3503.       NaN 
##  9 Fruit              125        400 
## 10 Furniture         5387.     20256.
## # … with 11 more rows

Similarly, if we wanted to do this for any numeric column in the data, we use where(is.numeric).

ac_items %>%
  group_by(category) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE, .names = "{col}_"))
## # A tibble: 21 x 5
##    category    num_id_ sell_value_ buy_value_ recipe_
##    <chr>         <dbl>       <dbl>      <dbl>   <dbl>
##  1 Accessories   3520.        637.      1912.    4.25
##  2 Bottoms       3541.        298.      1198.    7   
##  3 Bugs          3592.       2220.     10511.  NaN   
##  4 Dresses       3583.       1114.      3804.    4.47
##  5 Fish          3024.       3809.       NaN   NaN   
##  6 Flooring      3799.       1275.      3833.    4.41
##  7 Flowers       4946.        121.       160   NaN   
##  8 Fossils       3468.       3503.       NaN   NaN   
##  9 Fruit         2809.        125        400   NaN   
## 10 Furniture     3686.       5387.     20256.    5.01
## # … with 11 more rows

Using across programmatically

The above examples highlight why across is so useful in day-to-day analytic work flows. But the real reason I love across is so it makes programming with dplyr so much easier. Here we’re going to quite literally embrace across - and by ‘embrace’ I mean use {{}}.

In this example, we’ll create a function that asks the user to supply any number of numeric columns in their data, and the function will calculate the mean, standard deviation, and 0.05%-95% quantiles. We’ll also allow the user to supply a grouping variable if they want.

summarizer <- function(data, numeric_cols = NULL, ...) {
  data %>%
    group_by(...) %>%
    summarise(across({{numeric_cols}}, list(
      mean = ~mean(.x, na.rm = TRUE),
      sd = ~sd(.x, na.rm = TRUE),
      q05 = ~quantile(.x, 0.05, na.rm = TRUE),
      q95 = ~quantile(.x, 0.95, na.rm = TRUE)
    ), .names = "{col}_{fn}"))
}

Now we test the function.

summarizer(ac_items, numeric_cols = c(sell_value, buy_value),
           category)
## # A tibble: 21 x 9
##    category    sell_value_mean sell_value_sd sell_value_q05 sell_value_q95
##    <chr>                 <dbl>         <dbl>          <dbl>          <dbl>
##  1 Accessories            637.        1192.            122           2400 
##  2 Bottoms                298.         116.            175            455 
##  3 Bugs                  2220.        3209.            118.         10000 
##  4 Dresses               1114.        5247.            300           1600 
##  5 Fish                  3809.        4586.            180          15000 
##  6 Flooring              1275.        3506.            200           4800 
##  7 Flowers                121.         188.             40            240 
##  8 Fossils               3503.        1413.           1100           5500 
##  9 Fruit                  125           58.4           100            250 
## 10 Furniture             5387.       17604.            130          23775.
## # … with 11 more rows, and 4 more variables: buy_value_mean <dbl>,
## #   buy_value_sd <dbl>, buy_value_q05 <dbl>, buy_value_q95 <dbl>

There’s our quick look at across. I hope you can appreciate the versatility it offers. Not only does it cut back on typing, but it makes for a more principled approach to data wrangling and can make programming much easier. Also check out c_across which is useful for performing an operation that involves multiple columns.