We have two ways to chain data.table
operations, using data.table
pipes or using magrittr
pipes. For a data.table dt
, the data.table
pipes take the form of dt[][][]...
and magrittr
pipes dt %>% .[] %>% .[] %>% ...
.
Let’s first compare the readability of the two pipes in the following examples. Hadley Wickham criticized the readability of data.table
pipes in this stackoverflow post. The data.table
pipes, however, are not that hard to follow for those who are familiar with data.table
. To my eyes, the magrittr
pipes improve the readability but the data.table
pipes are still acceptable. It is more of a personal choice.
library(data.table)
library(magrittr)
# the data.table pipes
data.table(iris)[
# add a new column "is_setosa", 1 if yes and 0 if no, use two lines of codes
# to clearly show the values
Species == "setosa", is_setosa := 1
][
Species != "setosa", is_setosa := 0
][
# changes the Petal.Length of Species "versicolor", other species not affected
Species == "versicolor", Petal.Length := 999
][
# calculate sepal area when length > 5. Area is NA if length <= 5
Sepal.Length > 5, Sepal.Area := Sepal.Length * Sepal.Width
][
# select columns
, .(Species, is_setosa, Petal.Length, Sepal.Area)
][
# average of each column grouped by species
, lapply(.SD, mean, na.rm = TRUE), by = Species
]
## Species is_setosa Petal.Length Sepal.Area
## 1: setosa 1 1.462 19.76364
## 2: versicolor 0 999.000 16.87340
## 3: virginica 0 5.552 19.83633
# the magrittr pipes
data.table(iris) %>%
# add a new column "is_setosa", 1 if yes and 0 if no, use two lines of codes
# to clearly show the values
.[Species == "setosa", is_setosa := 1] %>%
.[Species != "setosa", is_setosa := 0] %>%
# changes the Petal.Length of Species "versicolor", other species not affected
.[Species == "versicolor", Petal.Length := 999] %>%
# calculate sepal area when length > 5. Area is NA if length <= 5
.[Sepal.Length > 5, Sepal.Area := Sepal.Length * Sepal.Width] %>%
# select columns
.[, .(Species, is_setosa, Petal.Length, Sepal.Area)] %>%
# average of each column grouped by species
.[, lapply(.SD, mean, na.rm = TRUE), by = Species]
## Species is_setosa Petal.Length Sepal.Area
## 1: setosa 1 1.462 19.76364
## 2: versicolor 0 999.000 16.87340
## 3: virginica 0 5.552 19.83633
Will the use of pipes %>%
slow down the computing? In the following code, we add three new columns to a made-up data table with data.table
pipes and magrittr
pipes. They almost have the same speed.
set.seed(123)
dt <- data.table(a = sample(letters, 1e5, replace = TRUE),
b = abs(rnorm(1e5)))
datatable_pipe <- function(){
dt[, x := sqrt(b)][
, y := b^2
][
, z := paste0(a , b)
]
}
magrittr_pipe <- function(){
dt[, x := sqrt(b)] %>%
.[, y := b^2] %>%
.[, z := paste0(a , b)]
}
rbenchmark::benchmark(datatable_pipe(), magrittr_pipe(), replications=20)
## test replications elapsed relative user.self sys.self
## 1 datatable_pipe() 20 3.883 1.059 3.869 0.012
## 2 magrittr_pipe() 20 3.667 1.000 3.657 0.008
## user.child sys.child
## 1 0 0
## 2 0 0
comments powered by Disqus