[scikit-learn] classification model that can handle missing values w/o learning from missing values

Martin Gütlein guetlein at posteo.de
Thu Mar 2 04:01:45 EST 2023

It would already help us, if someone could confirm that this is not 
possible in sci-kit learn, because we are still not entirely sure that 
we have no missed something.?


Am 21.02.2023 15:48 schrieb Martin Gütlein:
> Hi,
> I am looking for a classification model in python that can handle
> missing values, without imputation and "without learning from missing
> values", i.e. without using the fact that the information is missing
> for the inference.
> Explained with the help of decision trees:
> * The algorithm should NOT learn whether missing values should go to
> the left or right child (like the HistGradientBoostingClassifier).
> * Instead it could built the prediction for each child node and
> aggregate these (like some Random Forest implementations do).
> If that is not possible in sci-kit learn, maybe you have already
> discussed this? Or you know of a fork of sci-kit learn that is able to
> do this, or some other python library?
> Any help would be really appreciated, kind regards,
> Martin
> P.S. Here is my use-case, in case you are interested: I have a binary
> classification problem with a positive and a negative class, and two
> types of features A and B. In my training data, I have a lot more data
> (90%) where B is missing. In my test data, I always have B, which is
> good because the B features are better than the A features. In the
> cases where B is present in the training data, the ratio of positive
> examples is much higher than when its missing. So what
> HistGradientBoostingClassifier does, it uses the fact that B is not
> missing in the test data, and predicts way too many positives.
> (Additionally, some feature values of type A are also often missing)

More information about the scikit-learn mailing list