Gradient Boosted Decision Trees (GBDT)

A Basic Example Using the Iris Dataset

Table of Contents

Overview

In this example, we execute a Gradient Boosted Decision Tree (gbdt) model in order to classify plant species based on characteristic measurements of petals/sepals. We will not go into the mathematical details of the model. A few resources are listed below if you are interested in a deeper dive.

Briefly, a GBDT method is a meta classifier or an exmaple of an ensemble method. It takes decision trees built on various samples of the training dataset with possibly different parameters, and sort of averages them to yield the final model. The benefit of using this method is that it can be more accurate while at the same time being less prone to overfitting than a decision tree classifier. This method allows for optimization of an arbitrary differentiable loss function.

Prerequisites

This script assumes that you have reviewed the following (or already have this know-how):

Data Exploration

First we import a the iris dataset, and print a description of it so we can examine what is in the data. Remember in order to execute a 'cell' like the one below, you can 1) click on it and run it using the run button above or 2) click in the cell and hit shift+enter.

We randomly select a quarter of our data to be the 'test' dataset. This way we can train our model on remaining data, and test it on data not used in training. Once we are confident that our model is generalizing well (i.e. there is not a HUGE different in the training/testing performance, or in other words, not obviously overfitting), then we can use all of our data to train the model.

Gradient Boosted Decision Tree (GBDT)

Briefly, a GBDT method is a meta classifier or an exmaple of an ensemble method. It takes decision trees built on various samples of the training dataset with possibly different parameters, and sort of averages them to yield the final model. The benefit of using this method is that it can be more accurate while at the same time being less prone to overfitting than a decision tree classifier. This method allows for optimization of an arbitrary differentiable loss function.

The gbdt model appears to have overfit the training data (performance score of 1.0), however it does also give a better performance than the decision tree* model on testing data.

*Decision tree classifier:

Feedback

If you have ideas on how to improve this post, please let me know: https://predictivemodeler.com/feedback/

Reference: py.iris_randomforest