This is part two of our NCAA football model building. You can find part one here. Now is when it gets a little more exciting. The steps for today are quite simple:
- read in data scraped in part 1
- add a few variables
- clean up the data
- prepare for the dirty, dirty matching and modeling to come
Read in the old
Your author took liberties to save down .rds
files of the data scraped in part one. We must import these files to begin.
ncaa_import <- readRDS("../../static/ncaa_import.rds")
fo_data <- readRDS("../../static/fo_data.rds")
(By the way, if you have questions about saving down files for import on blogdown
, see here for helpful tips.)
Passer rating time
To build our models, we plan to capture offensive and defense team statistics, for the running and passing game. We might also use a pace statistic but don’t rush me here. The theory goes that we will use a combination of these statistics to predict lines, compare those lines to the sportsbooks and examine the differences where we see outliers.
To capture the passing game efficiently we use the old passer rating formula. We have taken the coefficients and weights from Wikipedia, venerable sabermetric almanac that it is.
Since this applies to both offense and defense, let’s write a function to calculate the passer rating for us.
passer_rating <- function(df, pass_com, pass_att, yds_per_att, pass_td, int) {
a <- (df[[pass_com]] / df[[pass_att]] - 0.3) * 5
b <- (df[[yds_per_att]] - 3) * 0.25
c <- (df[[pass_td]] / df[[pass_att]]) * 20
d <- 2.375 - (df[[int]] / df[[pass_att]] * 25)
passer_rating <- ((a + b + c + d) / 6) * 100
rm(a, b, c, d)
return(passer_rating)
}
Ain’t she a beaut!? Not really. In any case, we will now take our tidy data, make it wide, calculate offensive and defensive passer rating stats, select variables of interest from the NCAA data and massage our quads in ebullient celebration.
# load, spread, calc
library(tidyverse)
ncaa_df <- ncaa_import %>%
mutate(cat_stat = paste0(category, "_", stat)) %>% # 'cause we can't have duplicate col names
select(-category, -stat) %>%
spread(cat_stat, value) %>%
mutate(pass_o_pass_rt = passer_rating(.,
"pass_o_pass_com",
"pass_o_pass_att",
"pass_o_yds_per_att",
"pass_o_pass_td",
"pass_o_int"),
pass_d_pass_rt = passer_rating(.,
"pass_d_yds_opp_cpl",
"pass_d_yds_opp_pass",
"pass_d_yds_yds_per_att",
"pass_d_yds_opp_pass_tds",
"pass_d_int_opp_int"))
Now we’ve got to select variables of interest: no mean feat on 42 variables. Here are the options.
names(ncaa_df)
## [1] "team" "pass_d_int_div"
## [3] "pass_d_int_g" "pass_d_int_int_ret_tds"
## [5] "pass_d_int_int_ret_yds" "pass_d_int_opp_int"
## [7] "pass_d_int_opp_pass" "pass_d_yds_div"
## [9] "pass_d_yds_g" "pass_d_yds_opp_cpl"
## [11] "pass_d_yds_opp_pass" "pass_d_yds_opp_pass_tds"
## [13] "pass_d_yds_opp_pass_yds" "pass_d_yds_yds_per_att"
## [15] "pass_d_yds_yds_per_comp" "pass_d_yds_ypg"
## [17] "pass_o_div" "pass_o_g"
## [19] "pass_o_int" "pass_o_pass_att"
## [21] "pass_o_pass_com" "pass_o_pass_td"
## [23] "pass_o_pass_yds" "pass_o_yds_per_att"
## [25] "pass_o_yds_per_comp" "pass_o_ypg"
## [27] "run_d_div" "run_d_g"
## [29] "run_d_opp_rush" "run_d_opp_rush_tds"
## [31] "run_d_opp_rush_yds" "run_d_yds_per_rush"
## [33] "run_d_ypg" "run_o_div"
## [35] "run_o_g" "run_o_rush"
## [37] "run_o_rush_td" "run_o_rush_yds"
## [39] "run_o_yds_per_rush" "run_o_ypg"
## [41] "pass_o_pass_rt" "pass_d_pass_rt"
Alright. Time to kill the weak and irrelevant.
# lock up the good columns and re-tidy the mighty
ncaa_df <- ncaa_df %>%
select(team,
pass_o_pass_rt,
pass_o_pass_att,
run_o_yds_per_rush,
run_o_rush,
pass_d_pass_rt,
pass_d_int_opp_pass,
run_d_yds_per_rush,
run_d_opp_rush,
pass_d_int_div) %>%
gather(cat_stat, value, -team) %>%
separate(cat_stat, into = c("category", "stat"), sep = "(?<=o|d)_") %>%
as_tibble()
# show what we've got
ncaa_df
## # A tibble: 2,268 x 4
## team category stat value
## <chr> <chr> <chr> <dbl>
## 1 abilene christian pass_o pass_rt 78.3
## 2 air force pass_o pass_rt 86.6
## 3 akron pass_o pass_rt 82.3
## 4 alabama pass_o pass_rt 112.
## 5 alabama a&m pass_o pass_rt 55.8
## 6 alabama st. pass_o pass_rt 56.5
## 7 albany (ny) pass_o pass_rt 70.2
## 8 alcorn pass_o pass_rt 88.4
## 9 appalachian st. pass_o pass_rt 105.
## 10 arizona pass_o pass_rt 90.0
## # ... with 2,258 more rows
We are now positioned to begin the dirty dirty dirty work.
(The regex here was actually the most laborious process of the post. Your author took his seat court side at the regex games and learned a thing or two about look arounds. Regex is a power player.)
The name game
One of the endless delights in data cleaning is name matching. Pulling as we are from two sources, the NCAA and Football Outsiders, it comes as no surprise that there isn’t pure consistency in the names of universities. We must rectify this issue before moving on.
A perusal of both datasets shows that the “state” vs. “st.” spelling might cause a lot of problems. After fixing that, an anti-join will reveal the remaining labor.
# create a team name key from NCAA division 1 programs as FO only has D1 data
ncaa_names <- ncaa_df %>%
filter(stat %in% "int_div" & value == 1) %>%
select(team) %>%
distinct()
# find the missing pieces
missing_names <- ncaa_names %>%
anti_join(fo_data %>% mutate(team = gsub("state", "st.", team)))
missing_names %>% print(n = Inf)
## # A tibble: 20 x 1
## team
## <chr>
## 1 army west point
## 2 central mich.
## 3 eastern mich.
## 4 fiu
## 5 fla. atlantic
## 6 ga. southern
## 7 la.-monroe
## 8 louisiana
## 9 miami (fl)
## 10 miami (oh)
## 11 middle tenn.
## 12 nc state
## 13 northern ill.
## 14 south fla.
## 15 southern california
## 16 southern miss.
## 17 ucf
## 18 uconn
## 19 western ky.
## 20 western mich.
# manual correction for missing names
fo_data <- fo_data %>%
mutate(team = gsub("state", "st.", team),
team = gsub("army", "army west point", team),
team = gsub("central michigan", "central mich.", team),
team = gsub("eastern michigan", "eastern mich.", team),
team = gsub("florida international", "fiu", team),
team = gsub("florida atlantic", "fla. atlantic", team),
team = gsub("georgia southern", "ga. southern", team),
team = gsub("ul-monroe", "la.-monroe", team),
team = gsub("ul-lafayette", "louisiana", team),
team = gsub("miami-fl", "miami (fl)", team),
team = gsub("miami-oh", "miami (oh)", team),
team = gsub("middle tennessee", "middle tenn.", team),
team = gsub("nc st.", "nc state", team),
team = gsub("northern illinois", "northern ill.", team),
team = gsub("south florida", "south fla.", team),
team = gsub("usc", "southern california", team),
team = gsub("southern miss", "southern miss.", team),
team = gsub("central florida", "ucf", team),
team = gsub("connecticut", "uconn", team),
team = gsub("western kentucky", "western ky.", team),
team = gsub("western michigan", "western mich.", team))
# check again
missing_names_fix <- ncaa_names %>%
anti_join(fo_data)
dim(missing_names_fix)
## [1] 0 1
The deed is done. We can now bind to produce our final stats data frame. Gloria in excelsis deo.
# I see you binding out on the floor
fbs_data <- bind_rows(ncaa_df, fo_data) %>%
filter(!stat %in% "int_div") # drop division stat
# A final look at the categories
fbs_data %>%
group_by(category) %>%
tally(sort = TRUE)
## # A tibble: 6 x 2
## category n
## <chr> <int>
## 1 off 520
## 2 pass_d 504
## 3 pass_o 504
## 4 run_d 504
## 5 run_o 504
## 6 def 390
# A final look at all the stats
fbs_data %>%
group_by(stat) %>%
tally(sort = TRUE)
## # A tibble: 10 x 2
## stat n
## <chr> <int>
## 1 pass_rt 504
## 2 yds_per_rush 504
## 3 pass_down_sp_plus 260
## 4 pass_sp_plus 260
## 5 rush_sp_plus 260
## 6 int_opp_pass 252
## 7 opp_rush 252
## 8 pass_att 252
## 9 rush 252
## 10 adj_pace 130
Thanks for sticking around. Up next, in the much-vaunted part 3 (?), we integrate schedule data and kick off the modeling process. Until then…