This tutorial gives an overview of boosted trees and how to implement them in tidymodels
.
You can download this .Rmd file below if you’d like to follow along. I do have a few hidden notes you can disregard. This document is a distill_article, so you may want to change to an html_document to knit. You will also need to delete any image references to properly knit, since you won’t have those images.
First, we load the libraries we will use. There will be some new ones you’ll need to install.
library(tidyverse) # for reading in data, graphing, and cleaning
library(tidymodels) # for modeling ... tidily
library(usemodels) # for tidymodels suggestions
library(xgboost) # for boosting - need to install, don't need to load
library(doParallel) # for parallel processing
library(vip) # for quick vip plot
library(lubridate) # for dates
library(moderndive) # for King County housing data
library(patchwork) # for combining plots nicely
library(rmarkdown) # for paged tables
theme_set(theme_minimal()) # my favorite ggplot2 theme :)
Then we load the data we will use throughout this tutorial and do some modifications. As I mentioned before, I wouldn’t need to take the log here, but I do so I can compare to other models, if desired.
Boosting is a machine learning algorithm that is similar to bagging or random forests, but the trees are NOT independent of one another. Instead, we use information from prior trees to inform how we build the next trees. The procedure is outlined below (based on Algorithm 8.2 from ISLR)
Initialize: In this step, we set the fitted values of all observations to \(0\), \(\hat{f}(x) = 0\) and the residuals to the actual values of the response variable, \(r_i = y_i\). We could think of this as step \(0\) and could label it \(\hat{f}^0(x) = 0\)
Iteration 1:
Iteration 2:
\[
\hat{f}(x) = \hat{f}(x) + \lambda \hat{f}^2(x),
\] So, after 2 iterations \[
\hat{f}(x) = \lambda \hat{f}^1(x) + \lambda \hat{f}^2(x),
\] where \(\hat{f}^2(x)\) are the fitted values using the 2nd tree fit on the residuals.
3. And update the residuals, \[
r_i = r_i - \lambda \hat{f}(x_i).
\]
So, after the 2nd iteration, \(r_i = y_i - \lambda \hat{f}^1(x_i) - \lambda \hat{f}^2(x_i)\).
Iteration t:
Fit the next tree, \(\hat{f}^t\), with \(d\) splits on the residuals updated from the previous model, \(r_i\).
Update the fitted values to \[ \hat{f}(x) = \hat{f}(x) + \lambda \hat{f}^t(x). \]
Update the residuals, \[ r_i = r_i - \lambda \hat{f}(x_i). \]
Final model:
The final model is defined below, where \(T\) is the number of trees. \(T\) is another parameter that is tuned.
\[ \hat{f}(x) = \sum_{j = 1}^{T} \lambda \hat{f}^j(x) \]
I have skipped over a lot of details. Please see the resources above for further depth. A similar algorithm is used for classification, although it is slightly more complex since computing residuals is more complicated.
tidymodels
With the basics of how this model works in mind, let’s jump into how to set this up using our tidymodels
toolkit.
We’ll once again be using the King County housing data, which was prepped above. The next step is to split the data.
set.seed(327) #for reproducibility
# Randomly assigns 75% of the data to training.
house_split <- initial_split(house_prices,
prop = .75)
house_training <- training(house_split)
house_testing <- testing(house_split)
We do some transformations. I used one-hot dummy encoding as suggested by the use_xgboost()
function.
use_xgboost(log_price ~ .,
data = house_training)
xgboost_recipe <-
recipe(formula = log_price ~ ., data = house_training) %>%
step_novel(all_nominal(), -all_outcomes()) %>%
step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_spec <-
boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(),
loss_reduction = tune(), sample_size = tune()) %>%
set_mode("regression") %>%
set_engine("xgboost")
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(77987)
xgboost_tune <-
tune_grid(xgboost_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
# set up recipe and transformation steps and roles
boost_recipe <-
recipe(formula = log_price ~ .,
data = house_training) %>%
step_date(date,
features = "month") %>%
# Make these evaluative variables, not included in modeling
update_role(date,
new_role = "evaluative") %>%
step_novel(all_nominal_predictors()) %>% # recently learned about this helper
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors())
boost_recipe %>%
prep() %>%
juice()