[scikit-learn] How does the random state influence the decision tree splits?
Sebastian Raschka
mail at sebastianraschka.com
Sat Oct 27 20:24:50 EDT 2018
Thanks, Javier,
however, the max_features is n_features by default. But if you execute sth like
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=123,
shuffle=True,
stratify=y)
for i in range(20):
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
print(tree.score(X_test, y_test))
You will find that the tree will produce different results if you don't fix the random seed. I suspect, related to what you said about the random feature selection if max_features is not n_features, that there is generally some sorting of the features going on, and the different trees are then due to tie-breaking if two features have the same information gain?
Best,
Sebastian
> On Oct 27, 2018, at 6:16 PM, Javier López <jlopez at ende.cc> wrote:
>
> Hi Sebastian,
>
> I think the random state is used to select the features that go into each split (look at the `max_features` parameter)
>
> Cheers,
> Javier
>
> On Sun, Oct 28, 2018 at 12:07 AM Sebastian Raschka <mail at sebastianraschka.com> wrote:
> Hi all,
>
> when I was implementing a bagging classifier based on scikit-learn's DecisionTreeClassifier, I noticed that the results were not deterministic and found that this was due to the random_state in the DescisionTreeClassifier (which is set to None by default).
>
> I am wondering what exactly this random state is used for? I can imaging it being used for resolving ties if the information gain for multiple features is the same, or it could be that the feature splits of continuous features is different? (I thought the heuristic is to sort the features and to consider those feature values next to each associated with examples that have different class labels -- but is there maybe some random subselection involved?)
>
> If someone knows more about this, where the random_state is used, I'd be happy to hear it :)
>
> Also, we could then maybe add the info to the DecisionTreeClassifier's docstring, which is currently a bit too generic to be useful, I think:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py
>
>
> random_state : int, RandomState instance or None, optional (default=None)
> If int, random_state is the seed used by the random number generator;
> If RandomState instance, random_state is the random number generator;
> If None, the random number generator is the RandomState instance used
> by `np.random`.
>
>
> Best,
> Sebastian
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
More information about the scikit-learn
mailing list