[scikit-learn] classification model that can handle missing values w/o learning from missing values

Fri Mar 3 09:41:09 EST 2023

On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin Gütlein wrote:
> > 2. Ignores whether a value is missing or not for the inference
> What I meant is rather, that the missing value should NOT be treated as
> another possible value of the variable (this is e.g., what the
> HistGradientBoostingClassifier implementation in sk-learn does). Instead,
> multiple predictions could be done when a split-attribute is missing, and
> those can be averaged.

> This is how it is e.g. implemented in WEKA (we cannot switch do Java, though
> ;-):
> http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4
> and described by the inventors of the RF:
> https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

The text that you link to describes two types of strategies, one that is similar to that done in HistGradientBoosting, the other one that amounts to imputation using a forest, and can be done in scikit-learn by setting up the IteratuiveImputer to use forests as a base learner (this will however be slow).

Cheers,

Gaël