Advanced Data Science in R: Model Stacking

You can download this .Rmd file below if you’d like to follow along. I do have a few hidden notes you can disregard. This document is a distill_article, so you may want to change to an html_document to knit. You will also need to delete any image references to properly knit, since you won’t have those images.

Resources

Set up

library(tidyverse)         # for reading in data, graphing, and cleaning
library(tidymodels)        # for modeling ... tidily
library(stacks)            # for stacking models
library(glmnet)            # for regularized regression, including LASSO
library(ranger)            # for random forest model
library(kknn)              # for knn model
library(naniar)            # for examining missing values (NAs)
library(lubridate)         # for date manipulation
library(moderndive)        # for King County housing data
library(vip)               # for variable importance plots
library(rmarkdown)         # for paged tables
theme_set(theme_minimal()) # my favorite ggplot2 theme :)

data("house_prices")

house_prices %>% 
  slice(1:5)

id <chr>	date <date>	price <dbl>	bedrooms <int>	bathrooms <dbl>	sqft_living <int>
7129300520	2014-10-13	221900	3	1.00	1180
6414100192	2014-12-09	538000	3	2.25	2570
5631500400	2015-02-25	180000	2	1.00	770
2487200875	2014-12-09	604000	4	3.00	1960
1954400510	2015-02-18	510000	3	2.00	1680

This isn’t so new

The general idea of stacking is to combine predictions from many different models (called candidate models, candidate ensemble members, base learners, and probably other names I don’t know yet) into one “super” predictor. For example, we could fit a regression model, a decision tree, and a KNN model to the King County house price data and then average the predicted values from each of those models for a final prediction. That’s a simplified version of what we’ll do, but if you understand that, you’ll understand stacking. We will use the stacks package to help us out. I think it has one of the greatest hex stickers!

We have seen something similar to this before: bagging and random forest models. Let’s do a little review and remember what a random forest model is:

To review, let’s use our new tidymodels functions to fit a random forest model to the King County house price data.

Because I am going to use this later as one of the stacking models and some of those models will predict log(price), I am also going to predict log(price) in this model, even though I wouldn’t have to.

set.seed(327) #for reproducibility

house_prices <- house_prices %>% 
  mutate(log_price = log(price, base = 10)) %>% 
  select(-price)

# Randomly assigns 75% of the data to training.
house_split <- initial_split(house_prices, 
                             prop = .75)
house_split

#<training/testing/total>

house_training <- training(house_split)
house_testing <- testing(house_split)

First, create the recipe. With decision trees (and therefore random forests), we don’t have to do as much prepping of the data.

ranger_recipe <- 
  recipe(formula = log_price ~ ., 
         data = house_training) %>% 
  step_date(date, 
            features = "month") %>% 
  # Make these evaluative variables, not included in modeling
  update_role(all_of(c("id",
                       "date")),
              new_role = "evaluative")

Now, define the model. We could tune the mtry (this is

, the number of randomly sampled predictors considered at each split), min_n, and trees parameters, but to simplify, we’ll just set them to specific values.

We’ll use mtry = 6, which is about 1/3 of the number of predictor variables, a commonly chosen value of mtry; min_n = 10 (the minimum number of data points in a node that are required for the node to be split further, ie. anything smaller and it won’t be split more); and trees = 200, which is the number of decision trees that will be used in the ensemble.

ranger_spec <- 
  rand_forest(mtry = 6, 
              min_n = 10, 
              trees = 200) %>% 
  set_mode("regression") %>% 
  set_engine("ranger")

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec)

And then we fit the model (be sure to have the ranger library installed first, although it doesn’t need to be loaded):

ranger_fit <- ranger_workflow %>% 
  fit(house_training)

ranger_fit

With random forests, we can estimate the prediction error using the “out-of-bag” error (OOB error). Remember, with bootstrap samples, we sample with replacement so some observations are in the sample multiple times while others aren’t in it at all. The OOB observations are those that are not in the bootstrap sample. Find predicted values for OOB observations by averaging their predicted value for the trees where they were OOB. The OOB MSE is part of the fit() output and we can take the square root to get the RMSE.

# OOB error (MSE) ... yeah, it took me a while to find that.
ranger_fit$fit$fit$fit$prediction.error

#OOB RMSE
sqrt(ranger_fit$fit$fit$fit$prediction.error)

We will also evaluate this using cross-validation because we’ll need that later when we stack models. I’m using the same seed I used in the previous tutorial when we fit a linear regression because I want the same folds.

We also define a couple things that will be used in stacking. The metric_set() tells it we are using RMSE. We also use the control_stack_resamples() function to assure the assessment set predictions and and workflow used to fit the resamples is stored.

set.seed(1211) # for reproducibility
house_cv <- vfold_cv(house_training, v = 5)

metric <- metric_set(rmse)
ctrl_res <- control_stack_resamples()

ranger_cv <- ranger_workflow %>% 
  fit_resamples(house_cv, 
                metrics = metric,
                control = ctrl_res)

# Evaluation metrics averaged over all folds:
collect_metrics(ranger_cv)

.metric <chr>	.estimator <chr>	mean <dbl>	n <int>	std_err <dbl>	.config <chr>
rmse	standard	0.0767418	5	0.000715594	Preprocessor1_Model1

Notice that the cross-validated RMSE is about the same as the OOB RMSE, but cross-validated version takes longer to run (it is fitting 200 X 5 decision trees!). This is also a lot smaller than the RMSE obtained from the regular linear regression model or the lasso model.

Just for fun, before stacking, let’s also apply this model to the test data and plot the predicted vs. actual log price. Compared to this same graph from the linear regression model, this looks much better! There are still some issues with really small and really large values, but not nearly as bad.

ranger_workflow %>% 
  last_fit(house_split) %>% 
  collect_predictions() %>% 
  ggplot(aes(x = log_price, 
             y = .pred)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  labs(x = "Actual log(price)", 
       y = "Predicted log(price)")

Creating more candidate models

In order to create a stacked model, we first have to create the individual models. We have the random forest we just created, ranger_cv. Let’s also recreate the lasso model from the previous tutorial, but add a couple missing pieces. I do that below. We will end up with 10 models that will flow into the stacked model, since we tune 10 penalty parameters.

# lasso recipe and transformation steps
house_recipe <- recipe(log_price ~ ., 
                       data = house_training) %>% 
  step_rm(sqft_living15, sqft_lot15) %>%
  step_log(starts_with("sqft"),
           -sqft_basement, 
           base = 10) %>% 
  step_mutate(grade = as.character(grade),
              grade = fct_relevel(
                        case_when(
                          grade %in% "1":"6"   ~ "below_average",
                          grade %in% "10":"13" ~ "high",
                          TRUE ~ grade
                        ),
                        "below_average","7","8","9","high"),
              basement = as.numeric(sqft_basement == 0),
              renovated = as.numeric(yr_renovated == 0),
              view = as.numeric(view == 0),
              waterfront = as.numeric(waterfront),
              age_at_sale = year(date) - yr_built)%>% 
  step_rm(sqft_basement, 
          yr_renovated, 
          yr_built) %>% 
  step_date(date, 
            features = "month") %>% 
  update_role(all_of(c("id",
                       "date",
                       "zipcode", 
                       "lat", 
                       "long")),
              new_role = "evaluative") %>% 
  step_dummy(all_nominal(), 
             -all_outcomes(), 
             -has_role(match = "evaluative")) %>% 
  step_normalize(all_predictors(), 
                 -all_nominal())

#define lasso model
house_lasso_mod <- 
  linear_reg(mixture = 1) %>% 
  set_engine("glmnet") %>% 
  set_args(penalty = tune()) %>% 
  set_mode("regression")

# create workflow
house_lasso_wf <- 
  workflow() %>% 
  add_recipe(house_recipe) %>% 
  add_model(house_lasso_mod)

# penalty grid - changed to 10 levels
penalty_grid <- grid_regular(penalty(),
                             levels = 10)

# add ctrl_grid - assures predictions and workflows are saved
ctrl_grid <- control_stack_grid()

# tune the model using the same cv samples as random forest

house_lasso_tune <- 
  house_lasso_wf %>% 
  tune_grid(
    resamples = house_cv,
    grid = penalty_grid,
    metrics = metric,
    control = ctrl_grid
    )

We will create one more model type - K-nearest neighbors. We can use the same recipes and step as in the lasso model. So, we define the model, leaving the number of neighbors (k) to be tuned; create the workflow; and tune the model (install kknn package first - don’t need to load):

# create a model definition
knn_mod <-
  nearest_neighbor(
    neighbors = tune("k")
  ) %>%
  set_engine("kknn") %>% 
  set_mode("regression")

# create the workflow
knn_wf <- 
  workflow() %>% 
  add_model(knn_mod) %>%
  add_recipe(house_recipe)

# tune it using 4 tuning parameters
knn_tune <- 
  knn_wf %>% 
  tune_grid(
    house_cv,
    metrics = metric,
    grid = 4,
    control = ctrl_grid
  )

Stacking

Now that we have our set of candidate models (1 random forest, 10 lasso, and 4 knn), we’re ready to stack! We’re going to follow the process laid out on the stacks webpage.

First, create the stack. Notice in the message, it has removed some of the candidate models, likely since they were too similar to other lasso models.

house_stack <- 
  stacks() %>% 
  add_candidates(ranger_cv) %>% 
  add_candidates(house_lasso_tune) %>% 
  add_candidates(knn_tune)

We can look at the predictions from the candidate models in a tibble. The first column is the actual outcome, log_price and the other columns are the predictions from the various models. They are given names that make sense - how nice! Note that with classification models, there will be a column for each model for each level of the outcome variable. So, for a binary response there would be two prediction columns for each model.

as_tibble(house_stack)

ABCDEFGHIJ0123456789

log_price <dbl>	ranger_cv_1_1 <dbl>	house_lasso_tune_1_01 <dbl>	house_lasso_tune_1_07 <dbl>
5.380211	5.370978	5.527575	5.526734
5.672098	5.645753	5.673787	5.672770
5.579784	5.633210	5.831379	5.830520
5.711807	5.732696	5.781399	5.780877
5.623249	5.509589	5.634498	5.637947
5.311754	5.382971	5.519558	5.521753
5.195900	5.251695	5.372021	5.377183
5.402261	5.436788	5.587027	5.588300
5.581495	5.558966	5.666091	5.668132
5.806180	5.720755	5.684708	5.681504

Now, as mentioned at the beginning, we want to blend the predictions from each model together to form an even better overall prediction. LASSO will be used to do this via the blend_predictions() function.

From the documentation, blend_predictions() “evaluates a data stack by fitting a regularized model on the assessment predictions from each candidate member to predict the true outcome. This process determines the ‘stacking coefficients’ of the model stack. The stacking coefficients are used to weight the predictions from each candidate (represented by a unique column in the data stack), and are given by the betas of a LASSO model fitting the true outcome with the predictions given in the remaining columns of the data stack.”

house_blend <- 
  house_stack %>% 
  blend_predictions()

house_blend

house_blend$metrics %>% 
  filter(.metric == "rmse")

ABCDEFGHIJ0123456789

penalty <dbl>	mixture <dbl>	.metric <chr>	.estimator <chr>	mean <dbl>	n <int>	std_err <dbl>
1e-06	1	rmse	standard	0.07655287	25	0.0001950744
1e-05	1	rmse	standard	0.07655292	25	0.0001950768
1e-04	1	rmse	standard	0.07655272	25	0.0001951579
1e-03	1	rmse	standard	0.07655988	25	0.0001960957
1e-02	1	rmse	standard	0.07728230	25	0.0002038370
1e-01	1	rmse	standard	0.12618596	25	0.0002903839

We can examine some plots to see if we need to adjust the penalty parameter at all. First, a set of three plots with penalty on the x axis. We seem to have captured the smallest RMSE so I won’t adjust the penalty at all. You can change the penalty as an argument in the blend_predictions() function.

autoplot(house_blend)

We can also examine the plot with average number of members on the x-axis. You might wonder where the average comes from - it is because we are using cross-validation. So, we get different results for each penalty depending on the fold being left out. (Thanks to Max Kuhn and Simon Couch’s quick response to my request to fix the axes of the plot - yay!).

autoplot(house_blend, type = "members")

The last plot we might examine shows the blending weights for the top ensemble members.

autoplot(house_blend, type = "weights")

The last thing we need to do is fit the candidate models with non-zero stacking coefficients to the full training data. We use fit_members() to do that.

house_final_stack <- house_blend %>% 
  fit_members()

house_final_stack

Now we can use this just as we would use a single model. For example, we can predict with it on new data and create the same predicted vs. actual plot we made above.

house_final_stack %>% 
  predict(new_data = house_testing) %>% 
  bind_cols(house_testing) %>% 
  ggplot(aes(x = log_price, 
             y = .pred)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  labs(x = "Actual log(price)", 
       y = "Predicted log(price)")

Model Stacking

Authors

Affiliations

Published

DOI

Follow along

Resources

Set up

This isn’t so new

Creating more candidate models

Stacking

Footnotes