The Problem
I lately encountered a problem where I needed to extract many individual regexes from a .txt input file.
With a regex problem at hand, I naturally wrote everything up with extensive help from stringr::str_extract
.
My code executed slowly and I wondered why. A jaunt through profvis
output showed my many str_extract
calls with complicated regexes were the culprit. I was left with no choice but to build a faster mousetrap.
I give you … base_str_extract
:).
The Solution
Without further ado, here is the solution.
base_str_extract <- function(txt, pattern, perl = TRUE, ...) {
x <- regexpr(pattern, txt, perl = perl, ...)
x[which(x == -1)] <- NA
x[which(x != -1)] <- regmatches(txt, x)
as.character(x)
}
This bad boy is indeed a bit faster (more on that later), roughly 2x faster than str_extract
. The primary reason is that regexpr
calls c
internal functions and so rips. Meanwhile, str_extract
calls its “base” stringi
function and so lags. But even when calling stringi
directly, the base solution wins! Interestingly, stringi
is roughly as fast as stringr
in this domain.
Unsurprisingly, in the course of searching for solutions, I found the widely underimported stringb
package!! No surprise that Mr. Wickham was first to the punch on a (full throated and mature, especially compared to this pittance of a single function) translation of stringr
to base R
. I am pleased to say, though, base_str_extract
still wins (and handily)!
Another interesting side note. When first writing the function, I tried to assign NA
values in and ifelse
statement…something like this:
ifelse(x == -1, NA, regmatches(txt, x))
However, this was insufficient to overcome the regmatches
reality that empty matches are dropped as total vector length was preserved but not order. Additionally, adding useNames=FALSE
to which
didn’t help speed and calling .Internal(regexpr())
increased speed by <1% and given the lack of safety, the version presented here seemed best. You’ll note your lazy author has not bothered to extend this to match all regexes, but perhaps one day.
Benchmarking
To demonstrate the speed, let’s do a little experiment. We will pull in a free text field from the LA County Building Permits dataset.
Specifically, we will look at the “Work Description” field and extract, if it exists, the permit number to which the given permit is a supplement.
Because this dataset is a monster, I’ve used some Unix helpers to reduce the size of the call. The 29th column is “Work Description” so here I filter for cases where it is != "".
library(data.table)
library(stringr)
library(stringi)
library(microbenchmark)
# devtools::install_github("hadley/stringb")
library(stringb)
l <- "https://data.lacity.org/api/views/nbyu-2ha9/rows.csv?accessType=DOWNLOAD"
labp <- fread(cmd = sprintf("curl -s %s | head -n 100000 | awk -F, '$29!=\"\"' ", l))
# load our function
base_str_extract <- function(txt, pattern, perl = TRUE, ...) {
x <- regexpr(pattern, txt, perl = perl, ...)
x[which(x == -1)] <- NA
x[which(x != -1)] <- regmatches(txt, x)
as.character(x)
}
# specify regex for sake of ease
re <- "(SUPPLEMENTA?L?.+?)(\\d+-\\d+-\\d+)"
# function to extract the permit number & compare approaches
supp_permit_no <- function(txt,
regex,
.fun = c("base_str_extract", "str_extract", "stri_extract", "stringb")) {
switch(
.fun,
base_str_extract = {
cap <- base_str_extract(txt, regex, ignore.case = T)
gsub(regex, "\\2", cap, ignore.case = T)
},
str_extract = {
cap <- stringr::str_extract(txt, stringr::regex(pattern = regex, ignore_case = T))
gsub(regex, "\\2", cap, ignore.case = T)
},
stri_extract = {
cap <- stri_extract_first_regex(txt, pattern = regex, stri_opts_regex(case_insensitive = T))
gsub(regex, "\\2", cap, ignore.case = T)
},
stringb = {
cap <- stringb::str_extract(txt, pattern = stringb::regex(regex, ignore_case = T))
gsub(regex, "\\2", cap, ignore.case = T)
}
)
}
# have a gander at the results
microbenchmark(
bse = supp_permit_no(labp$`Work Description`, regex = re, .fun = "base_str_extract"),
se = supp_permit_no(labp$`Work Description`, regex = re, .fun = "str_extract"),
si = supp_permit_no(labp$`Work Description`, regex = re, .fun = "stri_extract"),
sb = supp_permit_no(labp$`Work Description`, regex = re, .fun = "stringb")
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## bse 29.0739 29.8371 32.80006 30.28550 37.06175 46.7939 100
## se 62.0423 63.3626 68.13381 64.75335 66.61440 88.5607 100
## si 61.9773 63.2101 68.67736 64.14975 66.26750 93.5834 100
## sb 85.6714 89.7574 95.01516 91.46615 98.28690 142.5528 100
all.equal(
supp_permit_no(labp$`Work Description`, regex = re, .fun = "base_str_extract"),
supp_permit_no(labp$`Work Description`, regex = re, .fun = "str_extract"),
supp_permit_no(labp$`Work Description`, regex = re, .fun = "stri_extract"),
supp_permit_no(labp$`Work Description`, regex = re, .fun = "stringb")
)
## [1] TRUE
There it is folks! The vaunted 2x improvement. It isn’t a monster gain, but for larger projects making many calls to extract text, one hopes your code benefits from adopting base_str_extract
.
In the event you believe you can do better, or if I’ve missed edge cases, please hit me on Twitter! I spent far too much time working to optimize this and so would be very keen to learn if I missed something.