Alright y’all. Welcome to part 3. You can find parts one and two at the links. Let us stay excited for the fitness you’ve come to witness.
Today’s plan proceeds thus:
- import last year’s NCAA FBS schedule
- bring in FBS data from part two and normalize data
- determine match ups and add in team data
Please note that this isn’t final as we still lack market information. However, this is a terrific beginning. In case any readers have a better implementation of the final look up, please hit me on Twitter.
Importing the NCAA schedule
library(tidyverse)
# set links to download and create tibbles
divs <- c("FBS", "FCS")
links <- paste0("http://fs.ncaa.org/Docs/stats/football_records/sched/2017/", divs, ".csv")
# import schedule for all divisions
schedule <- map_df(seq_along(links), ~
read_csv(links[.x]) %>%
mutate(div = divs[.x]) %>%
select_all(tolower) %>%
mutate(game_date = lubridate::mdy(game_date)) %>%
mutate_if(is.character, tolower))
# create matchup df
schedule_matchup <- schedule %>%
mutate(matchup = ifelse(loc == "h",
paste0(org_name, "_", opponent_name),
paste0(opponent_name, "_", org_name))) %>%
separate(matchup, into = c("home", "away"), sep = "_") %>%
select(date = game_date, home, away) %>%
distinct()
schedule_matchup
## # A tibble: 1,510 x 3
## date home away
## <date> <chr> <chr>
## 1 2017-09-02 air force vmi
## 2 2017-09-16 michigan air force
## 3 2017-09-23 air force san diego st.
## 4 2017-09-30 new mexico air force
## 5 2017-10-07 navy air force
## 6 2017-10-14 air force unlv
## 7 2017-10-20 nevada air force
## 8 2017-10-28 colorado st. air force
## 9 2017-11-04 air force army west point
## 10 2017-11-11 air force wyoming
## # ... with 1,500 more rows
Bring in FBS data
We’ve got the schedule pulled and cleaned. On to the next one, which, in this case, means importing the FBS statistics we pulled in part two. Your author took the liberty to save down the needed data from part two, so a simple readRDS
should do the trick.
This gets a little nasty as we apply some functionals to create new columns. Once again, astute readers please let me know if you see opportunities for improving this code.
# get opponent stats, first nest
sched_nest <- schedule %>% nest(-org_name)
# read in fbs_data
fbs_data <- readRDS("../../static/fbs_data.rds")
# make wide
fbs_wide <- fbs_data %>%
mutate(cat_stat = paste0(category, "_", stat)) %>%
select(team, cat_stat, value) %>%
spread(cat_stat, value)
# add in opponent stats
fbs_wide$opp_pass_d_pass_rt <- map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "pass_d_pass_rt"][[1]], na.rm = TRUE))
fbs_wide$opp_run_d_yds_per_rush <- map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "run_d_yds_per_rush"][[1]], na.rm = TRUE))
fbs_wide$opp_pass_o_pass_rt <- map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "pass_o_pass_rt"][[1]], na.rm = TRUE))
fbs_wide$opp_run_o_yds_per_rush <- map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "run_o_yds_per_rush"][[1]], na.rm = TRUE))
# calc normalized stats
fbs_wide_ncaa <- fbs_wide %>%
mutate(norm_pass_o_pass_rt = pass_o_pass_rt - opp_pass_d_pass_rt,
norm_run_o_yds_per_rush = run_o_yds_per_rush - opp_run_d_yds_per_rush,
norm_pass_d_pass_rt = pass_d_pass_rt - opp_pass_o_pass_rt,
norm_run_d_yds_per_rush = run_d_yds_per_rush - run_o_yds_per_rush) %>%
select(team, norm_pass_o_pass_rt, norm_run_o_yds_per_rush,
norm_pass_d_pass_rt, norm_run_d_yds_per_rush) %>%
gather(stat, value, -team) %>%
group_by(stat) %>%
mutate(value = case_when(
is.na(value) ~ quantile(value, 0.05, na.rm = TRUE),
TRUE ~ value)) %>%
spread(stat, value)
fbs_wide_ncaa
## # A tibble: 253 x 5
## team norm_pass_d_pas~ norm_pass_o_pas~ norm_run_d_yds_~
## <chr> <dbl> <dbl> <dbl>
## 1 abil~ 8.29 -8.72 1.73
## 2 air ~ 10.6 -1.61 1.11
## 3 akron -6.14 -2.22 1.82
## 4 alab~ -31.6 22.3 -3.01
## 5 alab~ -6.30 -28.5 1.28
## 6 alab~ -27.5 -27.8 0.210
## 7 alba~ -9.88 -9.35 0.440
## 8 alco~ -7.83 -4.39 -2.4
## 9 appa~ -9.97 14.6 -1.59
## 10 ariz~ -6.75 5.56 -1.95
## # ... with 243 more rows, and 1 more variable:
## # norm_run_o_yds_per_rush <dbl>
Weekly Matchups with data
Now we will compute our game statistics for each match up and add to the data frame. This function isn’t the cleanest, and any readers with ideas for improvement please let me know on Twitter.
# write function for normalizing
team_stat <- function(df, team_one = NULL, team_two = NULL,
stat_one = NULL, stat_two = NULL) {
team_one <- enquo(team_one)
team_two <- enquo(team_two)
stat_one <- enquo(stat_one)
stat_two <- enquo(stat_two)
# offensive stats
team_one_stat_sample <- df %>%
filter(team %in% !!team_one) %>%
pull(!!stat_one)
home_stats_pop <- df %>%
select(!!stat_one) %>%
summarize(mean = mean(.[[1]], na.rm = TRUE),
sd = sd(.[[1]], na.rm = TRUE))
# away stats
team_two_stat_sample <- df %>%
filter(team %in% !!team_two) %>%
pull(!!stat_two)
away_stats_pop <- df %>%
select(!!stat_two) %>%
summarize(mean = mean(.[[1]], na.rm = TRUE),
sd = sd(.[[1]], na.rm = TRUE))
# calcuate statistic
calc_stat <- ((team_one_stat_sample + team_two_stat_sample) -
(home_stats_pop$mean + away_stats_pop$mean)) /
sqrt(home_stats_pop$sd ^ 2 + away_stats_pop$sd ^ 2)
rm(team_one_stat_sample, home_stats_pop, team_two_stat_sample, away_stats_pop)
return(calc_stat)
}
Not the prettiest, but it gets the job done. Speaking of not pretty, wait until you feast your eyes on the code below. The difficulty encountered here is applying the function to each team, home and away, for each week. It takes an eternity to run but seems to work.
# create df with data by matchup for each game in 2017
fbs_full <- schedule_matchup %>%
mutate(home_pass_norm = map2(home, away, ~
team_stat(fbs_wide_ncaa, .x, .y, "norm_pass_o_pass_rt", "norm_pass_d_pass_rt")) %>%
lapply(., `length<-`, max(lengths(.))) %>%
unlist(),
away_pass_norm = map2(away, home, ~
team_stat(fbs_wide_ncaa, .x, .y, "norm_pass_o_pass_rt", "norm_pass_d_pass_rt")) %>%
lapply(., `length<-`, max(lengths(.))) %>%
unlist(),
home_run_norm = map2(home, away, ~
team_stat(fbs_wide_ncaa, .x, .y, "norm_run_o_yds_per_rush", "norm_run_d_yds_per_rush")) %>%
lapply(., `length<-`, max(lengths(.))) %>%
unlist(),
away_run_norm = map2(away, home, ~
team_stat(fbs_wide_ncaa, .x, .y, "norm_run_o_yds_per_rush", "norm_run_d_yds_per_rush")) %>%
lapply(., `length<-`, max(lengths(.))) %>%
unlist())
fbs_full
## # A tibble: 1,510 x 7
## date home away home_pass_norm away_pass_norm home_run_norm
## <date> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2017-09-02 air ~ vmi 1.06 -0.959 1.31
## 2 2017-09-16 mich~ air ~ -0.357 -0.899 0.564
## 3 2017-09-23 air ~ san ~ -0.0181 1.10 -0.847
## 4 2017-09-30 new ~ air ~ -0.693 0.977 1.34
## 5 2017-10-07 navy air ~ -0.0341 1.18 1.54
## 6 2017-10-14 air ~ unlv 0.556 0.161 0.312
## 7 2017-10-20 neva~ air ~ 0.965 1.11 0.907
## 8 2017-10-28 colo~ air ~ 0.991 -0.559 1.27
## 9 2017-11-04 air ~ army~ 0.127 -2.11 -0.0802
## 10 2017-11-11 air ~ wyom~ -0.833 1.01 0.523
## # ... with 1,500 more rows, and 1 more variable: away_run_norm <dbl>
For now, this is all. The next steps are, 1) pulling in market information and 2) translating our normalized scores into expected lines to judge against the market information. With that done we can look for the outliers and plan accordingly.