NCAA football model building part I

With the 2018-2019 NCAA football season almost upon us, the time is ripe for model building. This post is part one of $n$ posts describing the process. The goal of the process is to build a dynamically updating model for the analysis of college football game spreads.

This is part I: importing data for building and backtesting the model.

Building links for webscraping NCAA data

The input data will come from two sources: the NCAA website and Football Outsiders. We will programmatically create links to the sites we’d like to scrape. With any luck, these links will be as good as ever when they are updated throughout the 2018-2019 season.

library(tidyverse)
library(data.table)
library(rvest)

# create links for NCAA team data

divs <- c("fbs","fcs")
ncaa_link_root <- "https://www.ncaa.com/stats/football/"
ncaa_dir <- "/current/team/"
pgs <- c("/p1", "/p2", "/p3")
ncaa_cat <- data.table(num = c("25", "23", "695", "457", "24"),
                       category = c("pass_o","run_o","pass_d_yds","pass_d_int","run_d"))

ncaa_links <- map(seq_along(ncaa_cat$num), 
                      ~ paste0(rep(ncaa_link_root, 6),
                               rep(divs, 3),
                               rep(ncaa_dir, 6),
                               ncaa_cat$num[.x],
                               rep(pgs, 2))) %>% 
  unlist()

# create links for Football Outsiders data

fo_links <- c("https://www.footballoutsiders.com/stats/ncaaoff",
              "https://www.footballoutsiders.com/stats/ncaadef")

Writing the scraping functions and storing the data

The links are built. We will now make use of the handy purrr and rvest tidyverse packages to scrape the data from the links.

# import NCAA data

ncaa_import <- map_df(seq_along(ncaa_links), ~
                        read_html(ncaa_links[.x]) %>% 
                        html_nodes("table") %>% 
                        html_table() %>% 
                        .[[1]] %>% # select first list element
                        setNames(gsub(" ", "_", tolower(names(.)))) %>% # clean names
                        setNames(gsub("\\/", "_per_", names(.))) %>% 
                        mutate(div = str_extract(ncaa_links[.x], "fbs|fcs")) %>% # code division
                        mutate(div = if_else(div == "fbs", 1, 1.5)) %>% 
                        mutate(team = tolower(team)) %>% 
                        mutate(category = str_extract(ncaa_links[.x], "[0-9]+")) %>% # code category
                        select(-rank) %>% 
                        select(team, category, everything()) %>% 
                        gather(stat, value, 3:ncol(.))) %>% # make tidy
  setDT() %>% # setDT for matching
  .[ncaa_cat, on = c("category" = "num"), category := i.category] %>% # match code category to name
  .[, value := as.numeric(value)]

# show output

ncaa_import %>% as_tibble

## # A tibble: 9,822 x 4
##    team           category stat  value
##    <chr>          <chr>    <chr> <dbl>
##  1 oklahoma st.   pass_o   g        13
##  2 washington st. pass_o   g        13
##  3 oklahoma       pass_o   g        14
##  4 ucla           pass_o   g        13
##  5 arkansas st.   pass_o   g        12
##  6 new mexico st. pass_o   g        13
##  7 western ky.    pass_o   g        13
##  8 memphis        pass_o   g        13
##  9 texas tech     pass_o   g        13
## 10 ucf            pass_o   g        13
## # ... with 9,812 more rows

Looks pretty clean. We will validate and manipulate the data in a later post. For now, we are pleased with the NCAA data.

Next up is the Football Outsiders data.

# import football outsider data

fo_import <- map_df(seq_along(fo_links), ~
                      read_html(fo_links[.x]) %>% 
                      html_nodes("table") %>% 
                      html_table(header = TRUE) %>% 
                      # looks good so far but the column names are heinous...let's fix that
                      map(., ~ setNames(.x, make.names(tolower(names(.x)), unique = TRUE))) %>% 
                      bind_cols() %>% 
                      .[!.$team %in% "Team",] %>% # filter out rows in team col with the word "team"
                      mutate(category = str_sub(fo_links[.x], -3)) %>% # identify category
                      select(-contains("rk"), -team1) %>% # drop all the rank cols
                      select(team, category, everything()) %>% 
                      gather(stat, value, 3:ncol(.))) %>% 
  mutate(team  = tolower(team)) %>% 
  setDT()

# show output

fo_import %>% as_tibble()

## # A tibble: 3,380 x 4
##    team             category stat      value
##    <chr>            <chr>    <chr>     <chr>
##  1 oklahoma         off      off..s.p. 47.2 
##  2 central florida  off      off..s.p. 44.1 
##  3 oklahoma state   off      off..s.p. 41.7 
##  4 memphis          off      off..s.p. 41.0 
##  5 louisville       off      off..s.p. 38.8 
##  6 florida atlantic off      off..s.p. 38.8 
##  7 ohio state       off      off..s.p. 38.7 
##  8 arizona          off      off..s.p. 38.1 
##  9 ole miss         off      off..s.p. 37.8 
## 10 penn state       off      off..s.p. 37.6 
## # ... with 3,370 more rows

Not so bad. However, beneath the attractive surface there are considerable defect (a problem known all too well to your author). Since we know in advance we only want a certain subset of the Football Outsiders data, let’s select that and clean from there.

# clean football outsiders data

fo_data <- fo_import %>% 
  filter(stat %in% c("rushings.p.", "passings.p.", "pds.p.", "adj..pace")) %>% 
  mutate(stat = case_when(
    stat == "rushings.p." ~ "rush_sp_plus",
    stat == "passings.p." ~ "pass_sp_plus",
    stat == "pds.p."      ~ "pass_down_sp_plus",
    stat == "adj..pace"   ~ "adj_pace")) %>% 
  mutate(value = as.numeric(value))

# show output

fo_data %>% as_tibble()

## # A tibble: 910 x 4
##    team             category stat         value
##    <chr>            <chr>    <chr>        <dbl>
##  1 oklahoma         off      rush_sp_plus  152.
##  2 central florida  off      rush_sp_plus  111.
##  3 oklahoma state   off      rush_sp_plus  112.
##  4 memphis          off      rush_sp_plus   95 
##  5 louisville       off      rush_sp_plus  134.
##  6 florida atlantic off      rush_sp_plus  125.
##  7 ohio state       off      rush_sp_plus  138.
##  8 arizona          off      rush_sp_plus  113.
##  9 ole miss         off      rush_sp_plus  109.
## 10 penn state       off      rush_sp_plus  133.
## # ... with 900 more rows

There we have our clean data.

Conclusion

In this part I of our NCAA football model building, we demonstrated three things:

programmatically create links for webscraping
implement webscraping
clean the webscraped data

With this done, we can proceed to the next step: validation and model prepping. We are sure the anticipation is nearly too much to bare, but rest assured, updates are coming.