With the 2018-2019 NCAA football season almost upon us, the time is ripe for model building. This post is part one of \({n}\) posts describing the process. The goal of the process is to build a dynamically updating model for the analysis of college football game spreads.
This is part I: importing data for building and backtesting the model.
Building links for webscraping NCAA data
The input data will come from two sources: the NCAA website and Football Outsiders. We will programmatically create links to the sites we’d like to scrape. With any luck, these links will be as good as ever when they are updated throughout the 2018-2019 season.
library(tidyverse)
library(data.table)
library(rvest)
# create links for NCAA team data
divs <- c("fbs","fcs")
ncaa_link_root <- "https://www.ncaa.com/stats/football/"
ncaa_dir <- "/current/team/"
pgs <- c("/p1", "/p2", "/p3")
ncaa_cat <- data.table(num = c("25", "23", "695", "457", "24"),
category = c("pass_o","run_o","pass_d_yds","pass_d_int","run_d"))
ncaa_links <- map(seq_along(ncaa_cat$num),
~ paste0(rep(ncaa_link_root, 6),
rep(divs, 3),
rep(ncaa_dir, 6),
ncaa_cat$num[.x],
rep(pgs, 2))) %>%
unlist()
# create links for Football Outsiders data
fo_links <- c("https://www.footballoutsiders.com/stats/ncaaoff",
"https://www.footballoutsiders.com/stats/ncaadef")
Writing the scraping functions and storing the data
The links are built. We will now make use of the handy purrr
and rvest
tidyverse packages to scrape the data from the links.
# import NCAA data
ncaa_import <- map_df(seq_along(ncaa_links), ~
read_html(ncaa_links[.x]) %>%
html_nodes("table") %>%
html_table() %>%
.[[1]] %>% # select first list element
setNames(gsub(" ", "_", tolower(names(.)))) %>% # clean names
setNames(gsub("\\/", "_per_", names(.))) %>%
mutate(div = str_extract(ncaa_links[.x], "fbs|fcs")) %>% # code division
mutate(div = if_else(div == "fbs", 1, 1.5)) %>%
mutate(team = tolower(team)) %>%
mutate(category = str_extract(ncaa_links[.x], "[0-9]+")) %>% # code category
select(-rank) %>%
select(team, category, everything()) %>%
gather(stat, value, 3:ncol(.))) %>% # make tidy
setDT() %>% # setDT for matching
.[ncaa_cat, on = c("category" = "num"), category := i.category] %>% # match code category to name
.[, value := as.numeric(value)]
# show output
ncaa_import %>% as_tibble
## # A tibble: 9,822 x 4
## team category stat value
## <chr> <chr> <chr> <dbl>
## 1 oklahoma st. pass_o g 13
## 2 washington st. pass_o g 13
## 3 oklahoma pass_o g 14
## 4 ucla pass_o g 13
## 5 arkansas st. pass_o g 12
## 6 new mexico st. pass_o g 13
## 7 western ky. pass_o g 13
## 8 memphis pass_o g 13
## 9 texas tech pass_o g 13
## 10 ucf pass_o g 13
## # ... with 9,812 more rows
Looks pretty clean. We will validate and manipulate the data in a later post. For now, we are pleased with the NCAA data.
Next up is the Football Outsiders data.
# import football outsider data
fo_import <- map_df(seq_along(fo_links), ~
read_html(fo_links[.x]) %>%
html_nodes("table") %>%
html_table(header = TRUE) %>%
# looks good so far but the column names are heinous...let's fix that
map(., ~ setNames(.x, make.names(tolower(names(.x)), unique = TRUE))) %>%
bind_cols() %>%
.[!.$team %in% "Team",] %>% # filter out rows in team col with the word "team"
mutate(category = str_sub(fo_links[.x], -3)) %>% # identify category
select(-contains("rk"), -team1) %>% # drop all the rank cols
select(team, category, everything()) %>%
gather(stat, value, 3:ncol(.))) %>%
mutate(team = tolower(team)) %>%
setDT()
# show output
fo_import %>% as_tibble()
## # A tibble: 3,380 x 4
## team category stat value
## <chr> <chr> <chr> <chr>
## 1 oklahoma off off..s.p. 47.2
## 2 central florida off off..s.p. 44.1
## 3 oklahoma state off off..s.p. 41.7
## 4 memphis off off..s.p. 41.0
## 5 louisville off off..s.p. 38.8
## 6 florida atlantic off off..s.p. 38.8
## 7 ohio state off off..s.p. 38.7
## 8 arizona off off..s.p. 38.1
## 9 ole miss off off..s.p. 37.8
## 10 penn state off off..s.p. 37.6
## # ... with 3,370 more rows
Not so bad. However, beneath the attractive surface there are considerable defect (a problem known all too well to your author). Since we know in advance we only want a certain subset of the Football Outsiders data, let’s select that and clean from there.
# clean football outsiders data
fo_data <- fo_import %>%
filter(stat %in% c("rushings.p.", "passings.p.", "pds.p.", "adj..pace")) %>%
mutate(stat = case_when(
stat == "rushings.p." ~ "rush_sp_plus",
stat == "passings.p." ~ "pass_sp_plus",
stat == "pds.p." ~ "pass_down_sp_plus",
stat == "adj..pace" ~ "adj_pace")) %>%
mutate(value = as.numeric(value))
# show output
fo_data %>% as_tibble()
## # A tibble: 910 x 4
## team category stat value
## <chr> <chr> <chr> <dbl>
## 1 oklahoma off rush_sp_plus 152.
## 2 central florida off rush_sp_plus 111.
## 3 oklahoma state off rush_sp_plus 112.
## 4 memphis off rush_sp_plus 95
## 5 louisville off rush_sp_plus 134.
## 6 florida atlantic off rush_sp_plus 125.
## 7 ohio state off rush_sp_plus 138.
## 8 arizona off rush_sp_plus 113.
## 9 ole miss off rush_sp_plus 109.
## 10 penn state off rush_sp_plus 133.
## # ... with 900 more rows
There we have our clean data.
Conclusion
In this part I of our NCAA football model building, we demonstrated three things:
- programmatically create links for webscraping
- implement webscraping
- clean the webscraped data
With this done, we can proceed to the next step: validation and model prepping. We are sure the anticipation is nearly too much to bare, but rest assured, updates are coming.