[scikit-learn] Decision Tree Regressor - DepthFirstTreeBuilder vs BestFirstTreeBuilder
jmschreiber91 at gmail.com
Fri Sep 22 13:02:54 EDT 2017
Thanks for the questions!
1) Best first tends to product unbalanced but sparser trees, and frequently
produces more generalizable models by only capturing the most important
interactions. Unbalanced isn't necessarily bad either. You can imagine that
in some parts of the tree where there are complex split rules that are
important to learn, but in other parts of the tree the additional splits
only improve purity a tiny bit and risk overfitting (and thus being less
2) If you let best first and depth first run until purity is reached, they
will produce identical trees. The only difference is the ordering of the
nodes as they get added to the tree. Best first will add nodes to the tree
ordered by their increase in purity, and depth first adds nodes essentially
in the order one would do a depth-first search. If one were to stop best
first building early, they would get a tree where the important
interactions are captured first, whereas if one were to stop a depth-first
build early, they would get a really good split of one or maybe a few areas
of the dataset (generally speaking). The reason max_leaf_nodes decides if
BestFirstSplitter will be used or not is because it doesn't make sense to
limit a depth first build by the number of nodes, and it doesn't make sense
to run BestFirstSplitter without limiting the number of nodes in the tree.
Let me know if you have any further questions!
On Thu, Sep 21, 2017 at 1:38 PM, hanzi mao <hzmao at hotmail.com> wrote:
> I am reading the source code of the Decision Tree Regressor in sklearn. To
> build a tree, there are two fashions: depth first and best first. Best
> first fashion is adopted only when user set max_leaf_nodes. Otherwise,
> the tree will be built using the DepthFirstTreeBuilder. My questions are:
> 1. Are there any practical considerations when to use depth-first or
> best-first? Dose the depth-first fashion has a overwhelming advantage /
> popularity compared with the best-first one which makes it a default
> 2. I am kind of confused why using a optional parameter max_leaf_nodes
> to decide whether to use BestFirstTreeBuilder or not. I am wondering if
> there are some considerations when you decide to develop like this.
> scikit-learn mailing list
> scikit-learn at python.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the scikit-learn