[scikit-learn] classification model that can handle missing values w/o learning from missing values

Martin Gütlein guetlein at posteo.de
Fri Mar 10 08:19:09 EST 2023


Hi Gaël,

> [...] the other one that
> amounts to imputation using a forest, and can be done in scikit-learn
> by setting up the IteratuiveImputer to use forests as a base learner
> (this will however be slow).

The main difference is that when I use the IterativeImputer in 
scikit-learn, I still have to apply this imputation on the test set, 
before being able to predict with the RF. However, other implementations 
do not impute missing values, but instead split up the test instance.

I made the experience that this makes a big difference, and you are able 
to use features where the majority of values is missing, and where at 
the same time the class ratio of the examples with missing values is 
largely different to those without missing values.

Kind regards,
Martin





Am 03.03.2023 15:41 schrieb Gael Varoquaux:
> On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin Gütlein wrote:
>> > 2. Ignores whether a value is missing or not for the inference
>> What I meant is rather, that the missing value should NOT be treated 
>> as
>> another possible value of the variable (this is e.g., what the
>> HistGradientBoostingClassifier implementation in sk-learn does). 
>> Instead,
>> multiple predictions could be done when a split-attribute is missing, 
>> and
>> those can be averaged.
> 
>> This is how it is e.g. implemented in WEKA (we cannot switch do Java, 
>> though
>> ;-):
>> http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4
>> and described by the inventors of the RF:
>> https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1
> 
> The text that you link to describes two types of strategies, one that
> is similar to that done in HistGradientBoosting, the other one that
> amounts to imputation using a forest, and can be done in scikit-learn
> by setting up the IteratuiveImputer to use forests as a base learner
> (this will however be slow).
> 
> Cheers,
> 
> Gaël


More information about the scikit-learn mailing list