[scikit-learn] How does the random state influence the decision tree splits?

Sun Oct 28 04:34:41 EDT 2018

FYI: https://github.com/scikit-learn/scikit-learn/pull/12364

On Sun, 28 Oct 2018 at 09:32, Guillaume Lemaître <g.lemaitre58 at gmail.com>
wrote:

> There is always a shuffling when iteration over the features (even when
> going to all features).
> So in the case of a tie the split will be done on the first feature
> encounter which will be different due to the shuffling.
>
> There is a PR which was intending to make the algorithm deterministic to
> always select the same feature in the case of tie.
>
> On Sun, 28 Oct 2018 at 09:22, Fernando Marcos Wittmann <
> fernando.wittmann at gmail.com> wrote:
>
>> The random_state is used in the splitters:
>>
>>         SPLITTERS = SPARSE_SPLITTERS if issparse(X) else DENSE_SPLITTERS
>>
>>         splitter = self.splitter
>>         if not isinstance(self.splitter, Splitter):
>>             splitter = SPLITTERS[self.splitter](criterion,
>>                                                 self.max_features_,
>>                                                 min_samples_leaf,
>>                                                 min_weight_leaf,
>>                                                 random_state,
>>                                                 self.presort)
>>
>> Which is defined as:
>>
>> DENSE_SPLITTERS = {"best": _splitter.BestSplitter,
>>                    "random": _splitter.RandomSplitter}
>>
>> SPARSE_SPLITTERS = {"best": _splitter.BestSparseSplitter,
>>                     "random": _splitter.RandomSparseSplitter}
>>
>>
>> Both 'best' and 'random' uses random states. The DecisionTreeClassifier
>> uses 'best' as default `splitter` parameter. I am not sure how this 'best'
>> strategy was defined. The docs define as "Supported strategies are “best”.
>>
>>
>>
>>
>> On Sun, Oct 28, 2018 at 9:32 AM Piotr Szymański <niedakh at gmail.com>
>> wrote:
>>
>>> Just a small side note that I've come across with Random Forests which
>>> in the end form an ensemble of Decision Trees. I ran a thousand iterations
>>> of RFs on multi-label data and managed to get a 4-10 percentage points
>>> difference in subset accuracy, depending on the data set, just as a random
>>> effect, while I've seen papers report differences of just a couple pp as
>>> statistically significant after a non-parametric rank test.
>>>
>>> On Sun, Oct 28, 2018 at 7:44 AM Sebastian Raschka <
>>> mail at sebastianraschka.com> wrote:
>>>
>>>> Good suggestion. The trees look different. I.e., there seems to be a
>>>> tie at some point between choosing X[:, 0] <= 4.95 and X[:, 3] <= 1.65
>>>>
>>>> So, I suspect that the features are shuffled, let's call it X_shuffled.
>>>> Then at some point the max_features are selected, which is by default
>>>> X_shuffled[:, :n_features]. Based on that, if there's a tie between
>>>> impurities for the different features, it's probably selecting the first
>>>> feature in the array among these ties.
>>>>
>>>> If this is true (have to look into the code more deeply then) I wonder
>>>> if it would be worthwhile to change the implementation such that the
>>>> shuffling only occurs if  max_features < n_feature, because this way we
>>>> could have deterministic behavior for the trees by default, which I'd find
>>>> more intuitive for plain decision trees tbh.
>>>>
>>>> Let me know what you all think.
>>>>
>>>> Best,
>>>> Sebastian
>>>>
>>>> > On Oct 27, 2018, at 11:07 PM, Julio Antonio Soto de Vicente <
>>>> julio at esbet.es> wrote:
>>>> >
>>>> > Hmmm that’s weird...
>>>> >
>>>> > Have you tried to plot the trees (the decision rules) for the tree
>>>> with different seeds, and see if the gain for the first split is the same
>>>> even if the split itself is different?
>>>> >
>>>> > I’d at least try that before diving into the source code...
>>>> >
>>>> > Cheers,
>>>> >
>>>> > --
>>>> > Julio
>>>> >
>>>> >> El 28 oct 2018, a las 2:24, Sebastian Raschka <
>>>> mail at sebastianraschka.com> escribió:
>>>> >>
>>>> >> Thanks, Javier,
>>>> >>
>>>> >> however, the max_features is n_features by default. But if you
>>>> execute sth like
>>>> >>
>>>> >> import numpy as np
>>>> >> from sklearn.datasets import load_iris
>>>> >> from sklearn.model_selection import train_test_split
>>>> >> from sklearn.tree import DecisionTreeClassifier
>>>> >>
>>>> >> iris = load_iris()
>>>> >> X, y = iris.data, iris.target
>>>> >> X_train, X_test, y_train, y_test = train_test_split(X, y,
>>>> >>                                                   test_size=0.3,
>>>> >>                                                   random_state=123,
>>>> >>                                                   shuffle=True,
>>>> >>                                                   stratify=y)
>>>> >>
>>>> >> for i in range(20):
>>>> >>   tree = DecisionTreeClassifier()
>>>> >>   tree.fit(X_train, y_train)
>>>> >>   print(tree.score(X_test, y_test))
>>>> >>
>>>> >>
>>>> >>
>>>> >> You will find that the tree will produce different results if you
>>>> don't fix the random seed. I suspect, related to what you said about the
>>>> random feature selection if max_features is not n_features, that there is
>>>> generally some sorting of the features going on, and the different trees
>>>> are then due to tie-breaking if two features have the same information gain?
>>>> >>
>>>> >> Best,
>>>> >> Sebastian
>>>> >>
>>>> >>
>>>> >>
>>>> >>> On Oct 27, 2018, at 6:16 PM, Javier López <jlopez at ende.cc> wrote:
>>>> >>>
>>>> >>> Hi Sebastian,
>>>> >>>
>>>> >>> I think the random state is used to select the features that go
>>>> into each split (look at the `max_features` parameter)
>>>> >>>
>>>> >>> Cheers,
>>>> >>> Javier
>>>> >>>
>>>> >>> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka <
>>>> mail at sebastianraschka.com> wrote:
>>>> >>> Hi all,
>>>> >>>
>>>> >>> when I was implementing a bagging classifier based on
>>>> scikit-learn's DecisionTreeClassifier, I noticed that the results were not
>>>> deterministic and found that this was due to the random_state in the
>>>> DescisionTreeClassifier (which is set to None by default).
>>>> >>>
>>>> >>> I am wondering what exactly this random state is used for? I can
>>>> imaging it being used for resolving ties if the information gain for
>>>> multiple features is the same, or it could be that the feature splits of
>>>> continuous features is different? (I thought the heuristic is to sort the
>>>> features and to consider those feature values next to each associated with
>>>> examples that have different class labels -- but is there maybe some random
>>>> subselection involved?)
>>>> >>>
>>>> >>> If someone knows more about this, where the random_state is used,
>>>> I'd be happy to hear it :)
>>>> >>>
>>>> >>> Also, we could then maybe add the info to the
>>>> DecisionTreeClassifier's docstring, which is currently a bit too generic to
>>>> be useful, I think:
>>>> >>>
>>>> >>>
>>>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>>>> >>>
>>>> >>>
>>>> >>>   random_state : int, RandomState instance or None, optional
>>>> (default=None)
>>>> >>>       If int, random_state is the seed used by the random number
>>>> generator;
>>>> >>>       If RandomState instance, random_state is the random number
>>>> generator;
>>>> >>>       If None, the random number generator is the RandomState
>>>> instance used
>>>> >>>       by `np.random`.
>>>> >>>
>>>> >>>
>>>> >>> Best,
>>>> >>> Sebastian
>>>> >>> _______________________________________________
>>>> >>> scikit-learn mailing list
>>>> >>> scikit-learn at python.org
>>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> >>> _______________________________________________
>>>> >>> scikit-learn mailing list
>>>> >>> scikit-learn at python.org
>>>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> >>
>>>> >> _______________________________________________
>>>> >> scikit-learn mailing list
>>>> >> scikit-learn at python.org
>>>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>>> > _______________________________________________
>>>> > scikit-learn mailing list
>>>> > scikit-learn at python.org
>>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>>
>>> --
>>> Piotr Szymański
>>> niedakh at gmail.com
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>
>>
>> --
>>
>> Fernando Marcos Wittmann
>> MS Student - Energy Systems Dept.
>> School of Electrical and Computer Engineering, FEEC
>> University of Campinas, UNICAMP, Brazil
>> +55 (19) 987-211302
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>

-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181028/bcdcf84b/attachment-0001.html>