Entry 47: Pruning Decision Trees

8 minute read

If allowed to continue, a Decision Tree will continue to split the data until each leaf is pure. This causes two problems:

  • Overfitting
  • High complexity

I already covered overfitting in Entry 46, so in this entry I’ll go over how to deal with controlling complexity.

The notebook where I did my code for this entry can be found on my github page in the Entry 47 notebook.

The Problem

As a model gets deeper it becomes harder to interpret, negating one of the major benefits of using a Decision Tree. This becomes readily apparent when trees become arbitrarily large as can be seen in this model trained on the titanic dataset:

Titanic tree

The Options

The solution for this problem is to limit depth through a process called pruning. Pruning may also be referred to as setting a cut-off. There are several ways to prune a decision tree.

  • Pre-pruning: Where the depth of the tree is limited before training the model; i.e. stop splitting before all leaves are pure
    • There are several ways to limit splitting and can be done easily using parameters within sklearn.tree.DecisionTreeClassifier and sklearn.tree.DecisionTreeRegressor
      • max_depth: The maximum depth of the tree. Limits the depth of all branches to the same number.
      • min_samples_split: The minimum number of samples required to split an internal node.
      • min_samples_leaf: The minimum number of samples required to be at a leaf node.
      • max_leaf_nodes: Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.
      • min_impurity_decrease: A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
      • min_impurity_split: Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.
  • Post-pruning: Build the tree then cut back leaf nodes; i.e. remove nodes after training
    • Applied Predictive Modeling discusses two ways to do this on page 178
      • cost-complexity tuning: penalizes the impurity measure using the size of the tree
      • one-standard-error rule: find the smallest tree that is within one standard error of the tree with the smallest absolute error
    • The sklearn package doesn’t have a way to remove or consolidate nodes after training

Unlimited Tree

The unpruned tree I created from the Iris dataset is below. It has the following characteristics:

  • Depth = 5
  • Leaves = 8
  • Leaf node sample minimum = 1
  • Node split minimum = 3
  • Minimum Gini impurity at leaf = 0.0
  • Minimum Gini impurity at split = 0.051

Full Iris tree

Limit Depth

The most straight forwarded method of limiting a tree is to only take it to a specific depth.

In looking at the full Iris tree, we can see that the majority of observations are correctly classified by a depth of three. When we limit the tree to this depth it has the following characteristics:

  • Depth = 3
  • Leaves = 5
  • Leaf node sample minimum = 3
  • Node split minimum = 36
  • Minimum Gini impurity at leaf = 0.0
  • Minimum Gini impurity at split = 0.054

Depth 3 Iris tree

Require a Minimum Number of Samples to Split

Another way to limit tree growth is to require a minimum number of samples in a node for that node to be split. In practice, when the Iris example is required to have five samples in a node to be eligible for splitting it looks like the below and has the following characteristics:

  • Depth = 4
  • Leaves = 7
  • Leaf node sample minimum = 1
  • Node split minimum = 6
  • Minimum Gini impurity at leaf = 0.0
  • Minimum Gini impurity at split = 0.051

5 samples per split Iris tree

Require a Minimum Number of Samples in a Leaf

Tree growth can also be limited by requiring a minimum number of samples in a leaf node. For this example, I required at least five samples in any leaf node. The Iris example looks like the below and has the following characteristics:

  • Depth = 4
  • Leaves = 6
  • Leaf node sample minimum = 5
  • Node split minimum = 36
  • Minimum Gini impurity at leaf = 0.0
  • Minimum Gini impurity at split = 0.051

5 samples per leaf Iris tree

Limit the Number of Leaves

Another option to limit tree growth is to specify the maximum number of leaf nodes. In determining which leaf nodes to use, Scikit Learn’s Decision Tree Classifier uses the following criteria: “Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.” Using the Iris example, I limited the number of leaf nodes to five. This gives a tree with the following characteristics:

  • Depth = 3
  • Leaves = 5
  • Leaf node sample minimum = 1
  • Node split minimum = 38
  • Minimum Gini impurity at leaf = 0.0
  • Minimum Gini impurity at split = 0.051

5 leaf Iris tree

Minimum Impurity Decrease

The next pruning method is to set a required minimum on the decrease in the impurity measure. Remember that decreasing the impurity measure means that the purity of the node increases. So basically by setting a minimum for the decrease, you’re requiring a minimum improvement.

When a value of 0.04 is set for the minimum impurity decrease on the Iris data, we get a tree with the following characteristics:

  • Depth = 2
  • Leaves = 3
  • Leaf node sample minimum = 40
  • Node split minimum = 380
  • Minimum Gini impurity at leaf = 0.0
  • Minimum Gini impurity at split = 0.5

Min impurity decrease Iris tree

Minimum Impurity at Split

The last pruning method is similar to the previous one. We’re still working with the impurity measure, but this time we look at a minimum requirement to split a node. These two methods are similar to the pruning methods above where we limited the number of samples required in a parent node in order to split it vs the number required in a leaf.

When a value of 0.02 is set for the minimum impurity at a split on the Iris data, we get a tree with the following characteristics:

  • Depth = 5
  • Leaves = 6
  • Leaf node sample minimum = 1
  • Node split minimum = 3
  • Minimum Gini impurity at leaf = 0.0
  • Minimum Gini impurity at split = 0.201

Min split impority Iris tree

The Proposed Solution

After examining the results of specific changes to three datasets (Iris, Breast Cancer, and Titanic), several things popped out at me.

Limit Depth

Limiting depth appears rather arbitrary and manual. I wasn’t comfortable limiting depth until I had a fully trained tree. I’d look at the tree to determine where nodes started appearing with insignificantly small number of samples or decreased purity improvements, then deciding where to set the depth.

Also, some branch splits continued to appear useful after others had isolated extremely small sample groups (1-5 samples). So I’d want to limit one branch at depth X but wouldn’t want to limit a different branch until a lower depth of Y.

Here are some specific observations:

  • Iris dataset: the splits produce samples that don’t appear useful after a depth of 2; there are only 6 misclassifications at that level, but the tree continues down another 3 levels to a depth of 5
  • Breast Cancer dataset: the splits start producing nodes that don’t appear useful after a depth of 2; at a depth of 3, 6 of the 8 nodes have between 2 and 14 samples in a population of 455
  • Titanic dataset: has a depth of 22 for a population of 712. Nodes that don’t appear useful start appearing at a depth of 3

While all of the datasets started having useless nodes starting pretty high up in the tree (depth 2 and 3), there are other nodes that continue to have useful splits at deeper levels.

We can see this phenomena in the Iris dataset. At depth 1 all of the virginica samples are split out, as seen in the orange node. The setosa (green node) and versicolor (purple node) are then mostly separated at depth 2.

Min impurity decrease Iris tree

Minimum Samples per Split or Leaf

Setting a minimum requirement on the number of samples needed to either split a node or to create a leaf helps addresses the problem of arbitrarily small samples in a leaf. A good example is the Titanic dataset with 15 samples required in a leaf node.

15 samples per leaf Titanic tree

We can also see how setting the limitation on the split is less restricting than setting it on the leaf node. This tree was created with a 15 sample minimum on the split instead of the leaf:

15 samples per split Titanic tree

This feels like a good option to include in a solution, but perhaps not a standalone solution.

Maximum Number of Leaves

This one is fun to play with, but seems impractical at scale. Depending on the size, complexity, and quirks of the data, the optimal number of leaves could vary drastically. As such, it’d be extremely hard to come up with a maximum number of leaves prior to training the data. Which means, you’d have to train the model, then figure out how many leaves you wanted.

Minimum Impurity

Setting a minimum for the impurity to stop tree growth doesn’t address the problem of an arbitrarily small number of samples in a leaf. This problem can be seen in a section of the tree trained on the Titanic data with a min_impurity_split parameter of 0.02 where there are still leaves with sample sizes of 2 and 3:

Minimum impurity sample Titanic tree

Setting limitations around a required decrease in impurity appears to be the most scalable. This prevents continued splits with little to no improvement. A tree trained on the Titanic data with a min_impurity_decrease parameter of 0.02 results in this little depth 3 tree with 4 leaf nodes:

Min impurity decrease 0.02 Titanic tree

Even halving the parameter to 0.01 (now only requiring a 1% improvement in purity) results in this reasonably sized tree with the same depth of 3 and only 1 additional leaf node for a total of 5 leaves.

Min impurity decrease 0.01 Titanic tree

I’m excited to start playing with these types of limitations once I move to real world types of datasets. One of the things I’d like to try is combining minimum impurity to split with a minimum number of samples per leaf. These two parameters together should help limit overfitting while still being scalable across different datasets. It will be interesting to see whether this combination or the minimum impurity decrease will provide more consistent and accurate results.

Up Next

Impurity Measures

Resources