[scikit-learn] classification model that can handle missing values w/o learning from missing values

Martin Gütlein guetlein at posteo.de
Fri Mar 3 05:22:04 EST 2023

Dear Gaël,

Thanks for your response.

> 2. Ignores whether a value is missing or not for the inference
What I meant is rather, that the missing value should NOT be treated as 
another possible value of the variable (this is e.g., what the 
HistGradientBoostingClassifier implementation in sk-learn does). 
Instead, multiple predictions could be done when a split-attribute is 
missing, and those can be averaged.

This is how it is e.g. implemented in WEKA (we cannot switch do Java, 
though ;-): 
and described by the inventors of the RF: 

I am pretty sure something, similar is done in other classification 
algorithms, like naive bayes, where each feature is handled separately 
anyway and missing ones could just be omitted.


Am 03.03.2023 08:33 schrieb Gael Varoquaux:
> Dear Martin,
> From what I understand, you want a classifier that:
> 1. Is not based on imputation
> 2. Ignores whether a value is missing or not for the inference
> It seems to me that those two requirements are in contradiction, and
> it is not clear to me how such a classifier would be theoretically
> grounded.
> Best,
> Gaël
> On Thu, Mar 02, 2023 at 09:01:45AM +0000, Martin Gütlein wrote:
>> It would already help us, if someone could confirm that this is not 
>> possible
>> in sci-kit learn, because we are still not entirely sure that we have 
>> no
>> missed something.?
>> Regards,
>> Martin
>> Am 21.02.2023 15:48 schrieb Martin Gütlein:
>> > Hi,
>> > I am looking for a classification model in python that can handle
>> > missing values, without imputation and "without learning from missing
>> > values", i.e. without using the fact that the information is missing
>> > for the inference.
>> > Explained with the help of decision trees:
>> > * The algorithm should NOT learn whether missing values should go to
>> > the left or right child (like the HistGradientBoostingClassifier).
>> > * Instead it could built the prediction for each child node and
>> > aggregate these (like some Random Forest implementations do).
>> > If that is not possible in sci-kit learn, maybe you have already
>> > discussed this? Or you know of a fork of sci-kit learn that is able to
>> > do this, or some other python library?
>> > Any help would be really appreciated, kind regards,
>> > Martin
>> > P.S. Here is my use-case, in case you are interested: I have a binary
>> > classification problem with a positive and a negative class, and two
>> > types of features A and B. In my training data, I have a lot more data
>> > (90%) where B is missing. In my test data, I always have B, which is
>> > good because the B features are better than the A features. In the
>> > cases where B is present in the training data, the ratio of positive
>> > examples is much higher than when its missing. So what
>> > HistGradientBoostingClassifier does, it uses the fact that B is not
>> > missing in the test data, and predicts way too many positives.
>> > (Additionally, some feature values of type A are also often missing)
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn

More information about the scikit-learn mailing list