A Few Bars Blues: Part II - Whence the Volta?

Some time ago I wondered about the Blues 2018-2019 Stanley Cup-Winning season (see part I here. In particular, WTF? The incredible chemistry, the heroic turn around story, hometown heroes…what’s not to love!?

This post, the belated part II, finally nails down the reasons behind the turnaround. In short, it’s ya boy Jordan Binnington. As always, code at the bottom for the curious. Big thanks to Jozef Hajnala and his beautiful no frills NHL API package.

Background & Set Up

You’ll recall this plot from part I, which shows the Blues ’18-’19 relative point-wise progress through their Stanley Cup bid:

Waves of astonishment flood the mind even in this latter-day recollection. A team climbing from near the bottom to the tip top!

To conduct this review I had to decide at what point the turnaround (the volta, for those given to poetics and/or sonnetteering) occurred. I settled on game 40, a 3-0 victory on 2019-01-07 @ the Flyers.

Why game 40? In a less than scientific effort, I used a 5 game rolling win-equivalent point measure. It was after this game the club began their ascent, including their incredible 11 game winning streak in Feb 2019. Beyond game 40, the Blues were averaging at least 2.5 wins per every 5 games (save for one short spell at the end of the season, probably the result of some utilization management).

So What Explains the Turn?

To answer this question, I fit a few classifier models to predict whether a game was pre or post turnaround. A little experimentation showed a random forest model could hit 100% accuracy when fit on the whole dataset. Yes, there is over fitting, but as this was an explanatory adventure, that was cool w/ me. For those keeping score at home, the best CV model hit 96% accuracy out of sample.

In terms of variables for inclusion, I used play by play data to aggregate game-level event counts–e.g. hits, shots, saves, etc.

Without further ado, here were the results:

No one–NOBODY–should be shocked to learn that the advent of Jordan Binnington is the main driver of difference between the pre and post Blues experience in 2018-2019! The guy was a monumental force and changed the complexion of the Blues.

A Closer Look At The Key Drivers of The Volta

Ok, beyond JB, there are a few really interesting results here. The plot below shows the distribution of the top 6 (after that variable importance tails off) variables pre & post turnaround.

For anyone who has read Brian McDonald’s 2012 expected goals paper (the paper is here for those who haven’t), a few cool things pop out!

Opponent hits is the second most important classifier variable. But, as you can see, the distribution of opponent hits went up after the turn. McDonald notes this exact item in his paper, especially the sign in front of the coefficient. He speculates that hits are a proxy for possession, that is, more possession by the Blues in the post-volta period gave rise to more opponent hits. Related to this is the rise in opponent takeaways in the post-turn period. It’s easier for opponents to register more takeaways if the Blues have the puck more.
Also interesting is the change in Blues faceoff wins. McDonald notes this too. I don’t have a great explanation for this item … :)
Finally, the Blues Corsi shot distance is pretty cool. Not sure I have the calcs 100% correct (I modified this Evolving Wild script for my calcs), but assuming I do you’ll see that while the median shot distance remains constant post-turn, the distribution is wider and the IQR shifts up very slightly. The upward shift wasn’t statistically significant, but the wider left tail skew couldn’t have hurt!!

Conclusion & code

Forgive my belated scratching of the itch, but it was a pleasure to review one of the coolest NHL team-seasons on record. Great team, great story, and all just in time for the 2021 Playoffs.

In summary, Jordan Binnington and some the traditional expected goals variables hold the keys to explaining why the Blues were able to turn it all around in 2018-2019.

Code

library(nhlapi)
library(tidyverse)
library(extrafont)
library(rcartocolor)
library(tidymodels)
library(vip)
library(janitor)

# set theme

theme_set(
  theme_minimal(base_family = "Gill Sans MT") +
    theme(axis.line = element_line(color = "black"),
          axis.text = element_text(color = "black"),
          axis.title = element_text(color = "black"),
          legend.position = "top",
          panel.grid.minor = element_blank(),
          plot.caption = element_text(hjust = 0)
    )
)


# get stl ID

stl_id <- nhl_teams() %>% 
  filter(franchise.teamName == "Blues") %>% 
  pull(id)

# step one get Blues game ids for 2018-2019 season

game_ids <- nhl_schedule_seasons(
      2018,
      teamIds = stl_id,
      expand = "schedule.linescore"
    ) %>% 
  .[[1]] %>% 
  .[["dates"]] %>% 
  unnest(games) %>% 
  clean_names()

# function to get blues play by play data

get_play_by_play <- function(game_link) {
  
  gd <- file.path("https://statsapi.web.nhl.com", game_link) %>%
    nhl_get_data()
  
  sog <- cbind(unlist(gd[[1]]$liveData$linescore$teams)) %>% as.data.frame() %>% rownames_to_column("variable") %>% 
    filter(grepl("team.name|goals$|OnGoal$", variable)) %>% 
    mutate(link = game_link) %>% 
    pivot_wider(names_from = variable, values_from = V1) %>% 
    type_convert() %>% 
    mutate(stat_for_save_pct = (away.shotsOnGoal - away.goals) / away.shotsOnGoal) %>% 
    select(link, stat_for_save_pct)
  
  pbp <- gd[[1]] %>% 
    .[["liveData"]] %>% 
    .[["plays"]] %>% 
    .[["allPlays"]] %>% 
    as_tibble() %>% 
    mutate(link = game_link) %>% 
    select(link, everything())
  
  list("sog_data" = sog,
       "pbp_data" = pbp)
  
}

# get the play by play data for each game

pbp_get <- map(game_ids$link, get_play_by_play)

stl_pbp_data <- map_dfr(seq_along(pbp_get), ~ pbp_get[[.x]]$pbp_data) %>% 
  clean_names()

stl_sog_data <- map_dfr(seq_along(pbp_get), ~ pbp_get[[.x]]$sog_data) %>% 
  clean_names()

# using this, https://rpubs.com/evolvingwild/395136/, add in distances (not perfect!)

clean_pbp <- stl_pbp_data %>% 
  left_join(game_ids, by = "link") %>% 
  filter(game_type != "PR") %>% 
  mutate(game_number = match(link, unique(link)),
         event_distance = sqrt((89 - abs(coordinates_x))^2 + coordinates_y^2), 
         event_angle = abs(atan(coordinates_y / (89 - abs(coordinates_x))) * (180 / pi)),
         binnington_played = ifelse(grepl("saved by Jordan Binnington", result_description), 1, 0)) %>% 
  left_join(stl_sog_data, by = "link")

# get cumulative performance

stl_record <- clean_pbp %>% 
  filter(game_type == "R") %>% 
  select(c(game_number, matches("team_name$"), contains("league_record"))) %>% 
  select(-c(team_name, contains("type"), contains("linescore"))) %>% 
  distinct() %>% 
  pivot_longer(c(contains("team_name")), names_to = "home_away", values_to = "team") %>% 
  filter(team == "St. Louis Blues") %>% 
  mutate(wins = ifelse(grepl("away", home_away), teams_away_league_record_wins, teams_home_league_record_wins),
         losses = ifelse(grepl("away", home_away), teams_away_league_record_losses, teams_home_league_record_losses),
         ot = ifelse(grepl("away", home_away), teams_away_league_record_ot, teams_home_league_record_ot),
         points = wins * 2 + ot + losses * 0,
         point_slope = (zoo::rollmean(points - lag(points, 1), k = 5, na.pad = TRUE, align = "right")) * 5 / 2) %>% 
  select(game_number, points, wins, losses, ot, point_slope)

ggplot(stl_record, aes(game_number, point_slope)) +
  geom_hline(yintercept = 2.5, lty = 2, color = "darkgrey") +
  geom_line(color = "dodgerblue4", size = 1) +
  geom_point(aes(x = 40, y = 2), size = 4, color = "#ffc72b") +
  annotate(geom = "text", x = 40, y = 1.8, label = "Turnaround Point", hjust = 0, family = "Gill Sans MT", color = "#ffc72b", fontface = "bold") +
  scale_x_continuous(expand = expansion(add = c(0,0)), breaks = pretty_breaks(10)) +
  scale_y_continuous(limits = c(0, 5), expand = expansion(add = c(0,0.25))) +
  labs(x = "Game Number",
       y = "Win Equivalents",
       title = "Picking the Turnaround Point for the Blues in '18-'19",
       subtitle = "Rolling Win Equivalents* Per 5 Games | - - - Shows a 50% Win %",
       caption = "verbumdata.netlify.com\n*Win Equivalents derived from rolling point totals\nSource: jozefhajnala/nhlapi package")

# look at chgs after game 38 from my post

chg_data <- clean_pbp %>%
  mutate(team_name = ifelse(team_name == "St. Louis Blues", team_name, "Opponent")) %>% 
  group_by(game_number, team_name) %>% 
  count(result_event_type_id, result_secondary_type) %>% 
  filter(!is.na(team_name) & result_event_type_id != "CHALLENGE") %>% 
  mutate(epoch = ifelse(game_number <= 40, "bad", "good"))

# create data for modeling

diff_data <- chg_data %>% 
  group_by(game_number, result_event_type_id, result_secondary_type) %>% 
  summarize(team_name = team_name,
            stat_for = n[team_name == "St. Louis Blues"],
            stat_against = n[team_name == "Opponent"],
            epoch = unique(epoch), 
            .groups = "drop") %>% 
  mutate(result_secondary_type = gsub(" |-", "_", result_secondary_type),
         event = ifelse(
           !is.na(result_secondary_type), 
           tolower(paste(result_event_type_id, result_secondary_type, sep = "_")), 
           tolower(result_event_type_id)
           )
         )

# get binnington saves & corsi shot distances/angle

bsaves <- clean_pbp %>%
  group_by(game_number) %>%
  summarize(binnington_played = ifelse(sum(binnington_played, na.rm = T) > 0, 1, 0),
            stat_for_save_pct = unique(stat_for_save_pct))

corsi_shot_data <- clean_pbp %>% 
  filter(!is.na(team_name)) %>% 
  group_by(game_number, team_name) %>% 
  summarize(med_shot_dist = median(event_distance[result_event_type_id %in% c("BLOCKED_SHOT", "GOAL", "MISSED_SHOT", "SHOT")], na.rm = T),
            med_shot_angle = median(event_angle[result_event_type_id %in% c("BLOCKED_SHOT", "GOAL", "MISSED_SHOT", "SHOT")], na.rm = T),
            .groups = "drop") %>% 
  mutate(team_name = ifelse(team_name == "St. Louis Blues", team_name, "Opponent")) %>% 
  pivot_wider(names_from = team_name, values_from = c(med_shot_dist, med_shot_angle)) %>% 
  clean_names()

# make it wide
  
diff_reg <- diff_data %>% 
  select(game_number, epoch, event, stat_for, stat_against) %>% 
  distinct() %>% 
  pivot_wider(names_from = "event", values_from = c("stat_for", "stat_against"), values_fill = 0) %>% 
  left_join(bsaves, by = "game_number") %>% 
  left_join(corsi_shot_data, by = "game_number") %>% 
  mutate(epoch = factor(epoch, levels = unique(epoch)))

# make recipe

stl_rec <- recipe(epoch ~ ., data = diff_reg) %>%
  update_role(game_number, new_role = "Id") %>%
  step_impute_median(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_nzv(all_predictors())

# create model specs w tuning

rf_spec <- rand_forest(
    mtry = tune(),
    trees = 1000,
    min_n = tune()
  ) %>%
  set_mode("classification") %>%
  set_engine("ranger", importance = "permutation")

# tune wf & create cv 

tune_wf <- workflow() %>% 
  add_recipe(stl_rec) %>% 
  add_model(rf_spec)

stl_folds <- vfold_cv(diff_reg, 10)

set.seed(513)

tune_res <- tune_grid(
  tune_wf,
  resamples = stl_folds,
  grid = 20
)

best_auc <- select_best(tune_res, "roc_auc")

# final model

final_rf <- finalize_model(
  rf_spec,
  best_auc
)

final_wf <- workflow() %>% 
  add_recipe(stl_rec) %>% 
  add_model(final_rf) %>% 
  fit(diff_reg)

reg_results <- bind_cols(
  diff_reg,
  predict(final_wf, new_data = diff_reg)
)

# check performance

mean(reg_results$epoch == reg_results$.pred_class)

# pull vars out for VIP plot and clean up labels

t0 <- final_wf %>% 
  pull_workflow_fit() 

vi_names <- names(t0$fit$variable.importance) 
new_vi_names <- gsub("stat_for", "Blues", vi_names)
new_vi_names <- gsub("stat_against", "Opponent", new_vi_names)
new_vi_names <- gsub("med_shot_dist_st_louis_blues", "Blues Corsi Shots Dist.", new_vi_names)
new_vi_names <- gsub("med_shot_angle_st_louis_blues", "Blues Corsi Shots Angle", new_vi_names)
new_vi_names <- gsub("_", " ", new_vi_names)
new_vi_names <- toupper(new_vi_names)
new_vi_names <- case_when(
  new_vi_names == "OPPONENT HIT" ~ "OPPONENT HITS",
  new_vi_names == "BLUES HIT" ~ "BLUES HITS",
  new_vi_names == "BLUES FACEOFF" ~ "BLUES FACEOFF WINS",
  grepl("(GIVE|TAKE)AWAY", new_vi_names) ~ sprintf("%sS", new_vi_names),
  grepl(" SHOT(?!S)", new_vi_names, perl = T) ~ gsub(" SHOT", " SHOTS", new_vi_names),
  TRUE ~ new_vi_names
)

names(t0$fit$variable.importance) <- new_vi_names

t0 %>% 
  vip(num_features = 10) +
  geom_col(fill = "#002e88") +
  scale_x_discrete(expand = expansion(add = 0)) +
  scale_y_continuous(expand = expansion(add = 0)) +
  labs(y = "Variable Importance",
       title = "What Explains The Blues '18-'19 Turnaround?",
       subtitle = "Variable Importance from Random Forest Model",
       caption = "verbumdata.netlify.com\nSource: jozefhajnala/nhlapi package") +
  theme(plot.title.position = "plot",
        plot.caption.position = "plot")

# now create difference plots

orig_vip <- final_wf %>% 
  pull_workflow_fit() 

var_lookup <- tibble(
  dirty_vars = names(sort(orig_vip$fit$variable.importance, decreasing = T)[1:6]),
  clean_vars = names(sort(t0$fit$variable.importance, decreasing = T)[1:6])
)

chg_plot <- diff_reg[, colnames(diff_reg) %in% c("game_number", "epoch", var_lookup$dirty_vars)] %>% 
  pivot_longer(-c(game_number, epoch), names_to = "dirty_vars", values_to = "value") %>% 
  left_join(var_lookup, by = "dirty_vars") %>% 
  mutate(clean_vars = str_wrap(clean_vars, 11),
         clean_vars = factor(clean_vars, levels = str_wrap(var_lookup$clean_vars, 11)),
         epoch = ifelse(epoch == "good", "*Post*-Turn", "*Pre*-Turn"),
         epoch = factor(epoch, levels = unique(epoch)))

calc_boxplot_stat <- function(x) {
  coef <- 1.5
  n <- sum(!is.na(x))
  # calculate quantiles
  stats <- quantile(x, probs = c(0.0, 0.25, 0.5, 0.75, 1.0))
  names(stats) <- c("ymin", "lower", "middle", "upper", "ymax")
  iqr <- diff(stats[c(2, 4)])
  # set whiskers
  outliers <- x < (stats[2] - coef * iqr) | x > (stats[4] + coef * iqr)
  if (any(outliers)) {
    stats[c(1, 5)] <- range(c(stats[2:4], x[!outliers]), na.rm = TRUE)
  }
  return(stats)
}

# interesting to note, we see it here: http://www.hockeyanalytics.com/Research_files/NHL-Expected-Goals-Brian-Macdonald.pdf

ggplot(chg_plot, aes(x = epoch, y = value)) +
  stat_boxplot(geom ='errorbar', width = 0.25) +
  stat_summary(
    fun.data = calc_boxplot_stat, 
    geom = "boxplot", 
    fill = "#ffc72b", 
    width = 0.6, 
    size = 1,
    color = "#002e88"
    ) + 
  facet_wrap(~ clean_vars, scales = "free", nrow = 2) +
  labs(x = "",
       y = "",
       title = 'Pre & Post Turnaround Distributions for the Blues\' "Most Important" Explanatory Variables',
       subtitle = "Top 6 Variables, Outliers Removed!",
       caption = "verbumdata.netlify.com\nSource: jozefhajnala/nhlapi package") +
  theme(axis.text.x = ggtext::element_markdown(),
        strip.text = element_text(face = "bold"),
        plot.title.position = "plot",
        plot.caption.position = "plot")

# t test for means of corsi shots

t.test(
  diff_reg %>% filter(epoch == "bad") %>% pull(med_shot_dist_st_louis_blues),
  diff_reg %>% filter(epoch == "good") %>% pull(med_shot_dist_st_louis_blues)
)