Boosting

This tutorial gives an overview of boosted trees and how to implement them in tidymodels.

Follow along

You can download this .Rmd file below if you’d like to follow along. I do have a few hidden notes you can disregard. This document is a distill_article, so you may want to change to an html_document to knit. You will also need to delete any image references to properly knit, since you won’t have those images.

Resources

Set up

First, we load the libraries we will use. There will be some new ones you’ll need to install.

library(tidyverse)         # for reading in data, graphing, and cleaning
library(tidymodels)        # for modeling ... tidily
library(usemodels)         # for tidymodels suggestions
library(xgboost)           # for boosting - need to install, don't need to load
library(doParallel)        # for parallel processing
library(vip)               # for quick vip plot
library(lubridate)         # for dates
library(moderndive)        # for King County housing data
library(patchwork)         # for combining plots nicely
library(rmarkdown)         # for paged tables
theme_set(theme_minimal()) # my favorite ggplot2 theme :)

Then we load the data we will use throughout this tutorial and do some modifications. As I mentioned before, I wouldn’t need to take the log here, but I do so I can compare to other models, if desired.

data("house_prices")

# Create log_price and drop price variable
house_prices <- house_prices %>% 
  mutate(log_price = log(price, base = 10)) %>% 
  # make all integers numeric ... fixes prediction problem
  mutate(across(where(is.integer), as.numeric)) %>% 
  select(-price, -id)

Introduction

Boosting is a machine learning algorithm that is similar to bagging or random forests, but the trees are NOT independent of one another. Instead, we use information from prior trees to inform how we build the next trees. The procedure is outlined below (based on Algorithm 8.2 from ISLR)

Initialize: In this step, we set the fitted values of all observations to \(0\), \(\hat{f}(x) = 0\) and the residuals to the actual values of the response variable, \(r_i = y_i\). We could think of this as step \(0\) and could label it \(\hat{f}^0(x) = 0\)

Iteration 1:

  1. Fit a tree, \(\hat{f}^1\) with \(d\) splits (\(d\) will be a parameter we tune in the model). We fit this using all predictor variables (although only a small number will be used, somewhat depending on \(d\)) on the residuals, \(r_i\). Remember, in this first iteration, \(r_i = y_i\).
  2. Update the fitted values using \[ \hat{f}(x) = \hat{f}(x) + \lambda \hat{f}^1(x), \] , where \(\lambda\) is a shrinkage parameter, which we will also tune. This is a small positive number, typically something like .001 or .01. In this first iteration, the \(\hat{f}(x)\) to the right of the equals sign is \(0\). So, the updated fitted values are equal to \(\lambda \hat{f}^1(x)\).
  3. Update the residuals, \[ r_i = r_i - \lambda \hat{f}(x_i). \]
    Since the residuals were initially set to \(r_i = y_i\), in this first iteration, \(r_i = y_i - \lambda \hat{f}(x_i)\).

Iteration 2:

  1. Fit the next tree, \(\hat{f}^2\), with \(d\) splits on the residuals updated from the previous model, \(r_i\).
  2. Like in iteration 1, update the fitted values to

\[ \hat{f}(x) = \hat{f}(x) + \lambda \hat{f}^2(x), \] So, after 2 iterations \[ \hat{f}(x) = \lambda \hat{f}^1(x) + \lambda \hat{f}^2(x), \] where \(\hat{f}^2(x)\) are the fitted values using the 2nd tree fit on the residuals.
3. And update the residuals, \[ r_i = r_i - \lambda \hat{f}(x_i). \]
So, after the 2nd iteration, \(r_i = y_i - \lambda \hat{f}^1(x_i) - \lambda \hat{f}^2(x_i)\).

Iteration t:

  1. Fit the next tree, \(\hat{f}^t\), with \(d\) splits on the residuals updated from the previous model, \(r_i\).

  2. Update the fitted values to \[ \hat{f}(x) = \hat{f}(x) + \lambda \hat{f}^t(x). \]

  3. Update the residuals, \[ r_i = r_i - \lambda \hat{f}(x_i). \]

Final model:

The final model is defined below, where \(T\) is the number of trees. \(T\) is another parameter that is tuned.

\[ \hat{f}(x) = \sum_{j = 1}^{T} \lambda \hat{f}^j(x) \]

I have skipped over a lot of details. Please see the resources above for further depth. A similar algorithm is used for classification, although it is slightly more complex since computing residuals is more complicated.

Implementing in tidymodels

With the basics of how this model works in mind, let’s jump into how to set this up using our tidymodels toolkit.

We’ll once again be using the King County housing data, which was prepped above. The next step is to split the data.

set.seed(327) #for reproducibility

# Randomly assigns 75% of the data to training.
house_split <- initial_split(house_prices, 
                             prop = .75)
house_training <- training(house_split)
house_testing <- testing(house_split)

We do some transformations. I used one-hot dummy encoding as suggested by the use_xgboost() function.

use_xgboost(log_price ~ ., 
         data = house_training)
xgboost_recipe <- 
  recipe(formula = log_price ~ ., data = house_training) %>% 
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) 

xgboost_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("xgboost") 

xgboost_workflow <- 
  workflow() %>% 
  add_recipe(xgboost_recipe) %>% 
  add_model(xgboost_spec) 

set.seed(77987)
xgboost_tune <-
  tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
# set up recipe and transformation steps and roles
boost_recipe <- 
  recipe(formula = log_price ~ ., 
         data = house_training) %>% 
  step_date(date, 
            features = "month") %>% 
  # Make these evaluative variables, not included in modeling
  update_role(date,
              new_role = "evaluative") %>% 
  step_novel(all_nominal_predictors()) %>% # recently learned about this helper 
  step_dummy(all_nominal_predictors(), one_hot = TRUE)  %>% 
  step_zv(all_predictors()) 
boost_recipe %>% 
  prep() %>% 
  juice() 

Next, we specify the model. This is where we can set the parameters we would like to tune. You can read details about what each of the arguments mean here. We have seen some of these before with random forests. I initially tried tuning three of the parameters but the tune_grid() ran for over 90 minutes. So, I’m using the advice in HOML to tune the learning rate first. (It also seems the definition of mtry might have changed since I last used it. I have a question posted now and hope to have it answered soon.)

boost_spec <- boost_tree(
  trees = 1000,             # number of trees, T in the equations above
  tree_depth = 2,          # max number of splits in the tree
  min_n = 5,               # min points required for node to be further split
  loss_reduction = 10^-5,  # when to stop - smaller = more since it only has to get a little bit better 
  sample_size = 1,         # proportion of training data to use
  learn_rate = tune(),     # lambda from the equations above
  stop_iter = 50           # number of iterations w/o improvement b4 stopping
) %>% 
  set_engine("xgboost", colsample_bytree = 1) %>% #colsample_bytree = proportion of predictors used, 1 = all. Use rather than mtry in boost_tree()
  set_mode("regression")

Now, we create a grid of values to try in tuning. We’ll use this later.

boost_grid <- grid_regular(learn_rate(),
                           levels = 10)
boost_grid

And, let’s put the recipe and model specification into a workflow.

boost_wf <- workflow() %>% 
  add_recipe(boost_recipe) %>%
  add_model(boost_spec)  

Because boosting is time-consuming, even with just one tuning parameter, I am going to just use a validation sample, rather than our usual (and better) cross-validation sampling. I use 40% of the training data for the validation sample.

set.seed(494)
val_split <- validation_split(house_training, 
                              prop = .6)
val_split

Now we can train these models. I also saved the predictions. This takes between less than a minute to run on my computere (it used to take more than 2 minutes before I added the registerDoParallel() so it helps!).

set.seed(494)
registerDoParallel() # Meant to help with parallelizing (is that a word?) ... not sure how much it helps on my computer

boost_tune <- tune_grid(
  boost_wf, 
  val_split,
  grid = boost_grid,
  control = control_grid(save_pred = TRUE)
)

Let’s look at the results. We can see that larger learning rates actually seem to do better.

collect_metrics(boost_tune)

We could also graph these results.

collect_metrics(boost_tune) %>% 
  filter(.metric == "rmse") %>% 
  ggplot(aes(x = learn_rate, y = mean)) +
  geom_point() +
  geom_line() +
  scale_x_log10() +
  labs(y = "rmse") +
  theme_minimal()

We could use this learning rate and go back and try to optimize other parameters. I am going to skip that. Let’s select the best learning rate parameter and finalize the model.

# best learning rate, based on rmse
best_lr <- select_best(boost_tune, "rmse")
best_lr
# finalize workflow
final_boost_wf <- finalize_workflow(
  boost_wf,
  best_lr
)

# fit final
final_boost <- final_boost_wf %>% 
  fit(data = house_training)

And let’s take a quick look at important predictors.

final_boost %>% 
  pull_workflow_fit() %>%
  vip(geom = "col")

At this point we could save the final_boost model and use it on new data. For now I will apply it to the test data and visualize the results. (On a side note, when I fit the model on the training data and applied to test data using last_fit(), I got slightly different results. It is discussed here and here, and I need to further investigate.)

# Use model on test data
test_preds <- house_testing %>% 
  bind_cols(predict(final_boost, new_data = house_testing)) 

# Compute test rmse
test_preds %>% 
  summarize(rmse = sqrt(mean((log_price - .pred)^2))) %>% 
  pull(rmse)
[1] 0.08228594
# Graph results
test_preds %>% 
  ggplot(aes(x = log_price,
             y = .pred)) +
  geom_point(alpha = .5, 
             size = .5) +
  geom_smooth(se = FALSE) +
  geom_abline(slope = 1, 
              intercept = 0, 
              color = "darkred") +
  labs(x = "Actual log(price)", 
       y = "Predicted log(price)")

The test rmse is similar to the validation rmse, indicating that we are not overfitting. These results are similar, maybe slightly worse than the random forest model. It’s possible we could improve this even further by tuning some of the other parameters.

For an example that uses a binary response variable, see Julia Silge’s tutorial involving beach volleyball data.