Telling Home Sales Fables

Last week we looked at Zillow’s home sales data to understand the recent slowdown. We concluded that affordability problems have indeed weighed a bit on the housing market. More expensive markets have cooled more than cheaper markets.

Sometimes one is better lucky than good. Last night, I encountered a fantastic introduction to a 2018 R bombshell: the package fable. Here is the video and here are the slides from Mitch and Rob’s presentation. Big thanks are due to them and Earo Wang, cited for her wonderful work on the tsibble package.

This post is all about test driving fable. Luckily Zillow publishes unadjusted home sales estimates. We use the raw sales counts data from Zillow to try out the new time series titan!

Start with import & processing

# you might need library(devtools) then install_github("tidyverts/fable") & install_github("tidyverts/tsibbledata")

library(fable)
library(tsibbledata)
library(tidyverse)
library(scales)
library(janitor)

# import zillow data

zillow.import <- read_csv("http://files.zillowstatic.com/research/public/State/Sale_Counts_State.csv")

# process zillow data

zillow.processed <- zillow.import %>% 
  janitor::clean_names() %>% 
  select(-region_id, -size_rank) %>% 
  gather(date, value, x2008_03:ncol(.)) %>% 
  mutate(date = str_remove(date, "x"),
         date = lubridate::ymd(paste0(gsub("_", "-", date), "-01")),
         region_name = tolower(region_name))

# visual check for seasonality

zillow.processed %>% 
  filter(region_name %in% "california") %>% 
  ggplot(aes(date, value)) +
  geom_line()

Maiden fable … Mable?

Looks good. Now we can try out fable. First we need to create a tsibble object.

zillow.tsib <- zillow.processed %>% 
  mutate(date = yearmonth(date)) %>%  # IMPORTANT mutation, took a few tries to get right
  as_tsibble(key = id(region_name),
             index = date)

zillow.tsib

## # A tsibble: 6,400 x 3 [1M]
## # Key:       region_name [50]
##    region_name     date value
##    <chr>          <mth> <dbl>
##  1 alabama     2008 Mar  3347
##  2 alabama     2008 Apr  3664
##  3 alabama     2008 May  3667
##  4 alabama     2008 Jun  3863
##  5 alabama     2008 Jul  3879
##  6 alabama     2008 Aug  3614
##  7 alabama     2008 Sep  3448
##  8 alabama     2008 Oct  3233
##  9 alabama     2008 Nov  2253
## 10 alabama     2008 Dec  2763
## # ... with 6,390 more rows

This is super cool! We have a new clean ts() object in tidy format.

Test driving in California

Now we can produce a few plots for the historical fit.

# fit a simple ETS model

cal.fit <- zillow.tsib %>% 
  filter(region_name %in% "california") %>% 
  fable::ETS(log(value) ~ season("A"))

# now we can plot the actual versus the fitted

cal.fit %>% 
  augment() %>% 
  select(-.resid) %>% 
  rename(home_sales = `log(value)`) %>% 
  gather(type, value, home_sales:.fitted) %>% 
  mutate(value = exp(value)) %>% 
  ggplot(aes(date, value, color = type, alpha = type)) +
  geom_line(size = 1) +
  geom_smooth(aes(group = 1),
              method = "lm", 
              se = FALSE, 
              color = "black", 
              size = 0.3,
              linetype = "dashed") +
  scale_y_continuous(labels = scales::comma) +
  scale_color_manual(values = c("dodgerblue4", "darkolivegreen")) +
  scale_alpha_manual(values = c(0.75, 1.0),
                     guide = FALSE) +
  labs(x = "",
       y = "Home Sales Unadjusted",
       title = "Fitting an historical model to California home sales",
       subtitle = "ETS(M,Ad,A) fit on Zillow's all home sales data since 2008",
       caption = "verbumdata.netlify.com\nSource:Zillow") +
  theme(legend.title = element_blank(),
        legend.position = "top")

# finally we can look at the decomposed time series

cal.fit %>% 
  components() %>%
  gather(component, value, level, slope, season) %>%
  ggplot(aes(x = date, y = value)) +
  geom_line() +
  facet_grid(vars(component), scales = "free_y")

No modeling process would be complete without studying accuracy. I had some issues with the fable::accuracy function in fable. I wouldn’t be surprised to learn this was largely user error. But that is why you see the cumbersome calls to data structures underlying the model object.

# look at residuals

cal.fit$model[[1]]$est$.resid %>% forecast::checkresiduals()

# look at accuracy

cal.fit$model[[1]]$fit

## # A tibble: 1 x 9
##   method      period   sigma logLik   AIC  AICc   BIC     MSE    AMSE
##   <chr>        <dbl>   <dbl>  <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>
## 1 ETS(M,Ad,A)     12 0.00651   42.3 -48.7 -42.4  2.66 0.00404 0.00484

This isn’t the best model ever fit. The residuals are fairly normally distributed. But our ACF plot shows that a real analysis would need more transformations. Note the latest two (major negative) residuals, though. We can clearly see the surprise in California home sales.

Let’s now make a forecast!

# add in the forecast piece

cal.fit.fc <- cal.fit %>% 
  forecast(h = 12)

# plot

zillow.tsib %>% 
  filter(region_name %in% "california") %>% 
  ggplot(aes(date, value)) +
  geom_line(color = "darkolivegreen", size = 1) +
  geom_forecast(aes(ymin = lower, ymax = upper, level = level),
                fortify(cal.fit.fc), stat = "identity") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  labs(x = "",
       y = "Home Sales Unadjusted",
       title = "Forecasting California home sales with an ETS model",
       subtitle = "ETS(M,Ad,A) fit on Zillow's all home sales data since 2008 + forecast",
       caption = "verbumdata.netlify.com\nSource:Zillow") +
  theme(legend.position = "top")

Pretty interesting here. The model sees 2019 getting back to only ~39,000 home sales in the peak month of activity (June). That peak would be the lowest since 2011! The top end of the model estimate gets us back only to 2017’s peak. The trend does not appear to be a home seller’s friend in the Golden State.

In the future, maybe we can look at modeling many states. Thanks to Mitch and Rob (and others I’m sure!) for this pkg. Lots more to explore and learn.