I recently embarked on a web scraping project. The results will follow on this blog in weeks to come. In the meantime, though, I want to share one little lesson I learned which may be helpful to others webscraping in R
Imagine you have a vector of article links. You might start by scraping each with purrr
and rvest
. But say you wanted only to extract the body of the article (using the most probable html tag for such a thing across publishers). You’d probably run something like this code.
# news article dummy links
articles <- c("https://www.mylocalpaper.com/scandal-pg-1",
# scraping function
get_article_text <- function(article) {
html_nodes("p") %>%
# purrrfect
article_text <- map(articles, ~ get_article_text(.x))
And that wouldn’t be so bad. But, you’d notice one particularly pesky inclusion across most newspaper articles: comments.
For my purposes, I had to strip out comments. Below is a function which allows the intrepid webscraper to do just that. For that matter, you could exclude any undesirable node in your code. In case you can’t find the link, you might wrap the whole thing in purrr::possibly
and set otherwise to NA_character
# exclude comments with xml_remove
text_no_comment <- possibly(
function(article) {
art_html <- read_html(article)
xml2::xml_remove(art_html %>% html_nodes("#comments"))
#with comments removed now read body section as character & strip html
art_html %>%
html_nodes("p") %>%
otherwise = NA_character_
article_text <- map(articles, ~ text_no_comment(.x))
I hope this little trick helps those looking to clean up their webscraping functions. You could do purrr::map_chr
depending on your desired output.