4 min read

Faster str_extract in R

The Problem

I lately encountered a problem where I needed to extract many individual regexes from a .txt input file.

With a regex problem at hand, I naturally wrote everything up with extensive help from stringr::str_extract.

My code executed slowly and I wondered why. A jaunt through profvis output showed my many str_extract calls with complicated regexes were the culprit. I was left with no choice but to build a faster mousetrap.

I give you … base_str_extract :).

The Solution

Without further ado, here is the solution.

base_str_extract <- function(txt, pattern, perl = TRUE, ...) {
  x <- regexpr(pattern, txt, perl = perl, ...)
  x[which(x == -1)] <- NA
  x[which(x != -1)] <- regmatches(txt, x)
  as.character(x)
}

This bad boy is indeed a bit faster (more on that later), roughly 2x faster than str_extract. The primary reason is that regexpr calls c internal functions and so rips. Meanwhile, str_extract calls its “base” stringi function and so lags. But even when calling stringi directly, the base solution wins! Interestingly, stringi is roughly as fast as stringr in this domain.

Unsurprisingly, in the course of searching for solutions, I found the widely underimported stringb package!! No surprise that Mr. Wickham was first to the punch on a (full throated and mature, especially compared to this pittance of a single function) translation of stringr to base R. I am pleased to say, though, base_str_extract still wins (and handily)!

Another interesting side note. When first writing the function, I tried to assign NA values in and ifelse statement…something like this:

ifelse(x == -1, NA, regmatches(txt, x))

However, this was insufficient to overcome the regmatches reality that empty matches are dropped as total vector length was preserved but not order. Additionally, adding useNames=FALSE to which didn’t help speed and calling .Internal(regexpr()) increased speed by <1% and given the lack of safety, the version presented here seemed best. You’ll note your lazy author has not bothered to extend this to match all regexes, but perhaps one day.

Benchmarking

To demonstrate the speed, let’s do a little experiment. We will pull in a free text field from the LA County Building Permits dataset.

Specifically, we will look at the “Work Description” field and extract, if it exists, the permit number to which the given permit is a supplement.

Because this dataset is a monster, I’ve used some Unix helpers to reduce the size of the call. The 29th column is “Work Description” so here I filter for cases where it is != "".

library(data.table)
library(stringr)
library(stringi)
library(microbenchmark)
# devtools::install_github("hadley/stringb")
library(stringb)

l <- "https://data.lacity.org/api/views/nbyu-2ha9/rows.csv?accessType=DOWNLOAD"

labp <- fread(cmd = sprintf("curl -s %s | head -n 100000 | awk -F, '$29!=\"\"' ", l))

# load our function

base_str_extract <- function(txt, pattern, perl = TRUE, ...) {
  x <- regexpr(pattern, txt, perl = perl, ...)
  x[which(x == -1)] <- NA
  x[which(x != -1)] <- regmatches(txt, x)
  as.character(x)
}

# specify regex for sake of ease

re <- "(SUPPLEMENTA?L?.+?)(\\d+-\\d+-\\d+)"

# function to extract the permit number & compare approaches

supp_permit_no <- function(txt, 
                           regex, 
                           .fun = c("base_str_extract", "str_extract", "stri_extract", "stringb")) {
  
  switch(
    .fun,
    base_str_extract = {
      cap <- base_str_extract(txt, regex, ignore.case = T)
      gsub(regex, "\\2", cap, ignore.case = T)
    },
    str_extract = {
      cap <- stringr::str_extract(txt, stringr::regex(pattern = regex, ignore_case = T))
      gsub(regex, "\\2", cap, ignore.case = T)
    },
    stri_extract = {
      cap <- stri_extract_first_regex(txt, pattern = regex, stri_opts_regex(case_insensitive = T))
      gsub(regex, "\\2", cap, ignore.case = T)
    },
    stringb = {
      cap <- stringb::str_extract(txt, pattern = stringb::regex(regex, ignore_case = T))
      gsub(regex, "\\2", cap, ignore.case = T)
    }
  )
}

# have a gander at the results

microbenchmark(
  bse = supp_permit_no(labp$`Work Description`, regex = re, .fun = "base_str_extract"),
  se = supp_permit_no(labp$`Work Description`, regex = re, .fun = "str_extract"),
  si = supp_permit_no(labp$`Work Description`, regex = re, .fun = "stri_extract"),
  sb = supp_permit_no(labp$`Work Description`, regex = re, .fun = "stringb")
)
## Unit: milliseconds
##  expr     min      lq     mean   median       uq      max neval
##   bse 29.0739 29.8371 32.80006 30.28550 37.06175  46.7939   100
##    se 62.0423 63.3626 68.13381 64.75335 66.61440  88.5607   100
##    si 61.9773 63.2101 68.67736 64.14975 66.26750  93.5834   100
##    sb 85.6714 89.7574 95.01516 91.46615 98.28690 142.5528   100
all.equal(
  supp_permit_no(labp$`Work Description`, regex = re, .fun = "base_str_extract"),
  supp_permit_no(labp$`Work Description`, regex = re, .fun = "str_extract"),
  supp_permit_no(labp$`Work Description`, regex = re, .fun = "stri_extract"),
  supp_permit_no(labp$`Work Description`, regex = re, .fun = "stringb")
  )
## [1] TRUE

There it is folks! The vaunted 2x improvement. It isn’t a monster gain, but for larger projects making many calls to extract text, one hopes your code benefits from adopting base_str_extract.

In the event you believe you can do better, or if I’ve missed edge cases, please hit me on Twitter! I spent far too much time working to optimize this and so would be very keen to learn if I missed something.