[scikit-learn] Missing data and decision trees

Thu Oct 13 16:17:25 EDT 2016

Hi Stuart Reynold,

Like Jacob said we have an active PR at
https://github.com/scikit-learn/scikit-learn/pull/5974

You could do

git fetch https://github.com/raghavrv/scikit-learn.git
missing_values_rf:missing_values_rf
git checkout missing_values_rf
python setup.py install

And try it out. I warn you thought, there are some memory leaks I'm trying
to debug. But for the most part it works well and outperforms basic
imputation techniques.

Please let us know if it breaks / not solves your usecase. Your input as a
user of that feature would be invaluable!

> I ran into this several times as well with scikit-learn implementation of
GBM. Look at xgboost if you have not already (is there someone out there
that hasn't ? :)- it deals with missing values in the predictor space in a
very eloquent manner. http://xgboost.readthedocs.io/
en/latest/python/python_intro.html
<http://xgboost.readthedocs.io/en/latest/python/python_intro.html>

The PR handles it in a conceptually similar approach. It is currently
implemented for DecisionTreeClassifier. After reviews and integration,
DecisionTreeRegressor would also be supporting missing values. Once that
happens, enabling it in gradient boosting will be possible.

Thanks for the interest!!

On Thu, Oct 13, 2016 at 8:33 PM, Raphael C <drraph at gmail.com> wrote:

> You can simply make a new binary feature (per feature that might have a
> missing value) that is 1 if the value is missing and 0 otherwise.  The RF
> can then work out what to do with this information.
>
> I don't know how this compares in practice to more sophisticated
> approaches.
>
> Raphael
>
>
> On Thursday, October 13, 2016, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
>
>> I'm looking for a decision tree and RF implementation that supports
>> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>>
>> It seems that scikit's decision tree algorithm doesn't allow this --
>> which is disappointing because its one of the few methods that should be
>> able to sensibly handle problems with high amounts of missingness.
>>
>> Are there plans to allow missing data in scikit's decision trees?
>>
>> Also, is there any particular reason why missing values weren't supported
>> originally (e.g. integrates poorly with other features)
>>
>> Regards
>> - Stuart
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20161013/e23e7ee4/attachment-0001.html>