[scikit-learn] Decision stubs?

Stuart Reynolds stuart at stuartreynolds.net
Sun Aug 27 18:23:45 EDT 2017

Is it possible to efficiently get at the branch statistics that
decision tree algorithms iterate over in scikit?

For example if the root population has the class counts in the output vector:
   c0: 5000
   c1: 500

Then I'd like to iterate over:
# For a boolean (2 valued category)
   f1=True:      c0=3000,  c1=450
   f1=False:    c0=300,    c1=30
   f1=Null:       c0=1700,  c1=20  # ? Is considered?

# For a continuous value
   f2<10:         c0= ...  c1= ...
   f2>=10:         c0= ...  c1= ...

   f2<22:         c0= ...  c1= ...
   f2>=22:         c0= ...  c1= ...

I'd like to experiment with building models on-demand for each input
row in a predict.
To work efficiently, I'd like to reduce the training set to the 'most
significant' sub-space(s) using the population statistics.

I can do it in pandas, although its fairly inefficient to iterate over
each feature column many times.

- Stu

More information about the scikit-learn mailing list