Best NFL Teams? - Verbum Data

This quick post was inspired by an Aaron Schatz tweet today which remarked on the quantitatively “best” team in the recent past:

He links to pro-football-reference.com. I took a gander and saw immediately that the website makes yards per play easily available.

As luck would have it, your author was lately preoccupied with just this statistic. 2018 NFL data I analyzed in the recent past show that a positive yards per play differential (at the game level) translates into a win about 70% of the time.

So why not look at the best teams of the last 30 years according to yards per play differential?!?

Here are the results (before the code):

How we got here…

First up are formalities of loading libraries and saving plot settings.

library(rvest)
library(tidyverse)
library(janitor)
library(extrafont)

# set theme

theme_set(
  theme_minimal(base_family = "Gill Sans MT") +
    theme(
      axis.line = element_line(color = "black"),
      axis.text = element_text(color = "black"),
      panel.grid.minor = element_blank(),
      plot.caption = element_text(hjust = 0)
    )
)

Now can proceed to the good stuff: web-scraping. To start the process, we will:

Use rvest::read_html to read in the relevant base website
Create our own links with each team’s abbreviation and the years from 1990-2019
Celebrate

# get team abbreviations

team_abbs <- read_html("https://www.pro-football-reference.com/teams/")

links <- paste0(
  "https://www.pro-football-reference.com",
  team_abbs %>%
    html_nodes("a") %>% 
    html_attr("href") %>% 
    .[grepl("teams\\/[a-z]{3}\\/$", .)] %>% 
    rep(2019-1990+1),
  sort(rep(1990:2019, 32)),
  ".htm"
)

Grab the data

We have the links. Now we need to write a function to take those links, scrape the data, and clean it up for us to evaluate.

# get best ypp differential w custom function

yppDiff <- possibly(
  
  function(link) {
  
  # read in html
  pg <- read_html(link)
  
  # extract stats of interest
  stats <- pg %>% 
    html_nodes("table") %>% 
    html_table() %>% 
    .[[1]]
  
  # fix names and clean up df
  names(stats) <- unlist(ifelse(names(stats) == "", stats[1, ], paste(names(stats), stats[1, ], sep = "_")))
  stats <- clean_names(stats)
  names(stats)[names(stats) %in% c("tot_yds_to_ply", "tot_yds_to_y_p")] <- c("plays", "ypp")
  stats <- stats[-1, ]
  
  # add in team name & year
  
  stats$team <- str_extract(link, "(?<=teams\\/).+?(?=\\/)")
  stats$year <- str_extract(link, "\\d{4}")
  stats <- stats %>% select(team, year, everything())

  return(stats)
  
  },
  otherwise = NA_character_
)

With the function written, we can pipe it into execution. Now, before anyone complains about my for loop, I will tell you that from the violence of historical web-scraping, the loop is advantageous over base::lapply or purrr::map in only one respect: if it fails, you easily keep your progress. Yes, purrr trickery can save you, but this is so much easier…I also like the facility with which I can add a “counter” to check progress (I deleted it so it wouldn’t show on the post).

# get data

hist_data <- list()

for(i in seq_along(links)){
  hist_data[[i]] <- yppDiff(links[i])
}

# drop na & bind rows

hist_clean <- hist_data[!is.na(hist_data)] %>% 
  bind_rows() %>% 
  as_tibble()

Now, you’ll no doubt already know, but your author is lazy. Thus, I needed a quick mapping from the teams’ url abbreviations to human names. Thus the following chunk.

# now get team mapping

team_links <- links[grepl("2019", links)]

team_name <- possibly(
  function(link) {
    team_pg <- read_html(link)
    name <- team_pg %>% 
      html_nodes("h1") %>% 
      html_text() %>% 
      str_remove_all("\n|\\s{3,}|\\d{4,}|Statistics & Players")
    
    df_out <- tibble(
      abbr = str_extract(link, "(?<=teams\\/).+?(?=\\/)"),
      name = name
    )
    
  },
  otherwise = NA_character_
)

team_lookup <- map_dfr(team_links, team_name)

YPP FTW

Now all the pieces can come together. We can positively determine the best teams of the past 30 years!

# find best ypp diff

ypp_rank <- hist_clean %>% 
  filter(player %in% c("Team Stats", "Opp. Stats"),
         year != "2019") %>% 
  select(team, year, player, ypp) %>% 
  pivot_wider(
    names_from = "player", 
    values_from = "ypp", 
    names_repair = "universal"
  ) %>% 
  mutate_at(vars(-team), list(as.numeric)) %>%
  mutate(ypp_diff = Team.Stats - Opp..Stats) %>% 
  arrange(-ypp_diff) %>% 
  slice(1:25) %>% 
  left_join(team_lookup, by = c(team = "abbr")) %>% 
  mutate(ypp_unique = paste0(name, " (", year, ")"))

And let’s plot.

ggplot(ypp_rank, aes(fct_reorder(ypp_unique, ypp_diff), ypp_diff)) +
  geom_col(show.legend = FALSE, aes(fill = team)) +
  coord_flip() +
  scale_y_continuous(expand = c(0,0)) +
  labs(x = "",
       y = "Yards Per Play Differential",
       title = "Best Teams in the Past 30 Years By Yards/Play Differential",
       caption = "Source: verbumdata.netlify.com\npro-football-reference.com")

The 2001 Rams of Kurt " don’t forget Jesus " fame top our list. What a fabulous team. Aaron’s 1996 Green Bay Packers arrive at 22nd on the list. Not too shabby, but not the best. Oh, and, PS, I dropped all 2019 data for sake of small sample.