Decision Tree Classifier (DTC)

A Basic Example Using the Iris Dataset

Table of Contents

Overview

In this example, we execute a Decision Tree Classifier (dtc) model in order to classify plant species based on characteristic measurements of petals/sepals. We will not go into the mathematical details of the model. A few resources are listed below if you are interested in a deeper dive.

Prerequisites

This script assumes that you have reviewed the following (or already have this know-how):

Data Exploration

First we import a the iris dataset, and print a description of it so we can examine what is in the data. Remember in order to execute a 'cell' like the one below, you can 1) click on it and run it using the run button above or 2) click in the cell and hit shift+enter.

We randomly select a quarter of our data to be the 'test' dataset. This way we can train our model on remaining data, and test it on data not used in training. Once we are confident that our model is generalizing well (i.e. there is not a HUGE different in the training/testing performance, or in other words, not obviously overfitting), then we can use all of our data to train the model.

Basic Decision Tree

The goal of the basic decision tree is to find decision boundaries as a series of if-then-else statements resulting in classifying observations into different categories.

For more information:

Using the above charge you can see the classification as a series of if-true, if-false branches. You can also output the tree more compactly, as below:

Note that the decision tree essentially overfit the training data, giving a performance score of 1.0. The score on the testing data is lower at 89%.

Feedback

If you have ideas on how to improve this post, please let me know: https://predictivemodeler.com/feedback/

Reference: py.iris_dtc