Random Forest Classifier (RFC)

A Basic Example Using the Iris Dataset

Table of Contents

Overview

In this example, we execute a Random Forest Classifier (rfc) model in order to classify plant species based on characteristic measurements of petals/sepals. We will not go into the mathematical details of the model. A few resources are listed below if you are interested in a deeper dive.

Briefly, a Random Forest method is a meta classifier or an exmaple of an ensemble method. It takes decision trees built on various samples of the training dataset with possibly different parameters, and sort of averages them to yield the final model. The benefit of using this method is that it can be more accurate while at the same time being less prone to overfitting than a decision tree classifier.

Prerequisites

This script assumes that you have reviewed the following (or already have this know-how):

Data Exploration

First we import a the iris dataset, and print a description of it so we can examine what is in the data. Remember in order to execute a 'cell' like the one below, you can 1) click on it and run it using the run button above or 2) click in the cell and hit shift+enter.

We randomly select a quarter of our data to be the 'test' dataset. This way we can train our model on remaining data, and test it on data not used in training. Once we are confident that our model is generalizing well (i.e. there is not a HUGE different in the training/testing performance, or in other words, not obviously overfitting), then we can use all of our data to train the model.

Random Forest Classifier

Briefly, a Random Forest method is a meta classifier or an exmaple of an ensemble method. It takes decision trees built on various samples of the training dataset with possibly different parameters, and sort of averages them to yield the final model. The benefit of using this method is that it can be more accurate while at the same time being less prone to overfitting than a decision tree classifier.

Note that while the decision tree classifier had the training score at 1.0 (i.e. it overfit), the random forest performance is more consistent across training and testing datasets. The testing performance is a little better than that for the decision tree classifier.Decision tree classifier:

Feedback

If you have ideas on how to improve this post, please let me know: https://predictivemodeler.com/feedback/

Reference: py.iris_randomforest