[scikit-learn] classification model that can handle missing values w/o learning from missing values
Martin Gütlein
guetlein at posteo.de
Fri Mar 10 08:19:09 EST 2023
Hi Gaël,
> [...] the other one that
> amounts to imputation using a forest, and can be done in scikit-learn
> by setting up the IteratuiveImputer to use forests as a base learner
> (this will however be slow).
The main difference is that when I use the IterativeImputer in
scikit-learn, I still have to apply this imputation on the test set,
before being able to predict with the RF. However, other implementations
do not impute missing values, but instead split up the test instance.
I made the experience that this makes a big difference, and you are able
to use features where the majority of values is missing, and where at
the same time the class ratio of the examples with missing values is
largely different to those without missing values.
Kind regards,
Martin
Am 03.03.2023 15:41 schrieb Gael Varoquaux:
> On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin Gütlein wrote:
>> > 2. Ignores whether a value is missing or not for the inference
>> What I meant is rather, that the missing value should NOT be treated
>> as
>> another possible value of the variable (this is e.g., what the
>> HistGradientBoostingClassifier implementation in sk-learn does).
>> Instead,
>> multiple predictions could be done when a split-attribute is missing,
>> and
>> those can be averaged.
>
>> This is how it is e.g. implemented in WEKA (we cannot switch do Java,
>> though
>> ;-):
>> http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4
>> and described by the inventors of the RF:
>> https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1
>
> The text that you link to describes two types of strategies, one that
> is similar to that done in HistGradientBoosting, the other one that
> amounts to imputation using a forest, and can be done in scikit-learn
> by setting up the IteratuiveImputer to use forests as a base learner
> (this will however be slow).
>
> Cheers,
>
> Gaël
More information about the scikit-learn
mailing list