[scikit-learn] Understanding sklearn.tree._tree.value object

Pranav Ashok pranavashok at gmail.com
Mon Oct 8 14:51:21 EDT 2018

I have a multi-class multi-label decision tree learnt using
DecisionTreeClassifier class. The input looks like follows:

X = [[2, 51], [3, 20], [5, 30], [7, 1], [20, 46], [25, 25], [45, 70]]
Y = [[1,2,3],[1,2,3],[1,2,3],[1,2],[1,2],[1],[1]]

I have used MultiLabelBinarizer to convert Y into

[[1 1 1]
 [1 1 1]
 [1 1 1]
 [1 1 0]
 [1 1 0]
 [1 0 0]
 [1 0 0]]

After training, the _tree.values looks like follows:

array([[[7., 0.],
        [2., 5.],
        [4., 3.]],

       [[3., 0.],
        [0., 3.],
        [0., 3.]],

       [[4., 0.],
        [2., 2.],
        [4., 0.]],

       [[2., 0.],
        [0., 2.],
        [2., 0.]],

       [[2., 0.],
        [2., 0.],
        [2., 0.]]])

I had the impression that the value array contains for each node, a
list of lists [[n_1, y_1], [n_2, y_2], [n_3, y_3]]
such that n_i are the number of samples disagreeing with class i and
y_i are the number of samples agreeing with
class i. But after seeing this output, it does not make sense.

For example, the root node has the value [[7,0],[2,5],[4,3]].
According to my interpretation, this would mean
7 samples disagree with class 1; 2 disagree with class 2 and 5 agree
with class 2; 4 disagree with class 3 and 3 agree with class 3.

which, according to the input dataset is wrong.

Could someone please help me understand the semantics of _tree.value
for multi-label DTs?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181008/767fa52b/attachment.html>

More information about the scikit-learn mailing list