[scikit-learn] How does the random state influence the decision tree splits?

Sat Oct 27 18:39:42 EDT 2018

Hi all,

when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default).

I am wondering what exactly this random state is used for? I can imaging it being used for resolving ties if the information gain for multiple features is the same, or it could be that the feature splits of continuous features is different? (I thought the heuristic is to sort the features and to consider those feature values next to each associated with examples that have different class labels -- but is there maybe some random subselection involved?)

If someone knows more about this, where the random_state is used, I'd be happy to hear it :)

Also, we could then maybe add the info to the DecisionTreeClassifier's docstring, which is currently a bit too generic to be useful, I think:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py

    random_state : int, RandomState instance or None, optional (default=None)
        If int, random_state is the seed used by the random number generator;
        If RandomState instance, random_state is the random number generator;
        If None, the random number generator is the RandomState instance used
        by `np.random`.

Best,
Sebastian