[scikit-learn] Decision Tree Regressor - DepthFirstTreeBuilder vs BestFirstTreeBuilder

Fri Sep 22 15:23:03 EDT 2017

Thanks Jacob!

You explain the ideas behind the two builders very well!

Best,

Hanna

________________________________
From: scikit-learn <scikit-learn-bounces+hzmao=hotmail.com at python.org> on behalf of Jacob Schreiber <jmschreiber91 at gmail.com>
Sent: Friday, September 22, 2017 1:02:54 PM
To: Scikit-learn mailing list
Subject: Re: [scikit-learn] Decision Tree Regressor - DepthFirstTreeBuilder vs BestFirstTreeBuilder

Hi Hanna

Thanks for the questions!

1) Best first tends to product unbalanced but sparser trees, and frequently produces more generalizable models by only capturing the most important interactions. Unbalanced isn't necessarily bad either. You can imagine that in some parts of the tree where there are complex split rules that are important to learn, but in other parts of the tree the additional splits only improve purity a tiny bit and risk overfitting (and thus being less generalizable).

2) If you let best first and depth first run until purity is reached, they will produce identical trees. The only difference is the ordering of the nodes as they get added to the tree. Best first will add nodes to the tree ordered by their increase in purity, and depth first adds nodes essentially in the order one would do a depth-first search. If one were to stop best first building early, they would get a tree where the important interactions are captured first, whereas if one were to stop a depth-first build early, they would get a really good split of one or maybe a few areas of the dataset (generally speaking). The reason max_leaf_nodes decides if BestFirstSplitter will be used or not is because it doesn't make sense to limit a depth first build by the number of nodes, and it doesn't make sense to run BestFirstSplitter without limiting the number of nodes in the tree.

Let me know if you have any further questions!

Jacob

On Thu, Sep 21, 2017 at 1:38 PM, hanzi mao <hzmao at hotmail.com<mailto:hzmao at hotmail.com>> wrote:

Hi,

I am reading the source code of the Decision Tree Regressor in sklearn. To build a tree, there are two fashions: depth first and best first.  Best first fashion is adopted only when user set max_leaf_nodes. Otherwise, the tree will be built using the DepthFirstTreeBuilder. My questions are:

  1.  Are there any practical considerations when to use depth-first or best-first? Dose the depth-first fashion has a overwhelming advantage / popularity compared with the best-first one which makes it a default choice?
  2.  I am kind of confused why using a optional parameter max_leaf_nodes  to decide whether to use BestFirstTreeBuilder or not. I am wondering if there are some considerations when you decide to develop like this.

Thanks!

Best,
Hanna

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org<mailto:scikit-learn at python.org>
https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170922/bb9bd0bb/attachment.html>