[scikit-learn] classification model that can handle missing values w/o learning from missing values

Fri Mar 3 02:33:31 EST 2023

Dear Martin,

>From what I understand, you want a classifier that:
1. Is not based on imputation
2. Ignores whether a value is missing or not for the inference

It seems to me that those two requirements are in contradiction, and it is not clear to me how such a classifier would be theoretically grounded.

Best,

Gaël

On Thu, Mar 02, 2023 at 09:01:45AM +0000, Martin Gütlein wrote:
> It would already help us, if someone could confirm that this is not possible
> in sci-kit learn, because we are still not entirely sure that we have no
> missed something.?

> Regards,
> Martin

> Am 21.02.2023 15:48 schrieb Martin Gütlein:
> > Hi,

> > I am looking for a classification model in python that can handle
> > missing values, without imputation and "without learning from missing
> > values", i.e. without using the fact that the information is missing
> > for the inference.

> > Explained with the help of decision trees:
> > * The algorithm should NOT learn whether missing values should go to
> > the left or right child (like the HistGradientBoostingClassifier).
> > * Instead it could built the prediction for each child node and
> > aggregate these (like some Random Forest implementations do).

> > If that is not possible in sci-kit learn, maybe you have already
> > discussed this? Or you know of a fork of sci-kit learn that is able to
> > do this, or some other python library?

> > Any help would be really appreciated, kind regards,
> > Martin

> > P.S. Here is my use-case, in case you are interested: I have a binary
> > classification problem with a positive and a negative class, and two
> > types of features A and B. In my training data, I have a lot more data
> > (90%) where B is missing. In my test data, I always have B, which is
> > good because the B features are better than the A features. In the
> > cases where B is present in the training data, the ratio of positive
> > examples is much higher than when its missing. So what
> > HistGradientBoostingClassifier does, it uses the fact that B is not
> > missing in the test data, and predicts way too many positives.
> > (Additionally, some feature values of type A are also often missing)
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Research Director, INRIA
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux