[scikit-learn] classification model that can handle missing values w/o learning from missing values

Martin Gütlein guetlein at posteo.de
Fri Mar 3 05:22:04 EST 2023


Dear Gaël,

Thanks for your response.

> 2. Ignores whether a value is missing or not for the inference
What I meant is rather, that the missing value should NOT be treated as 
another possible value of the variable (this is e.g., what the 
HistGradientBoostingClassifier implementation in sk-learn does). 
Instead, multiple predictions could be done when a split-attribute is 
missing, and those can be averaged.

This is how it is e.g. implemented in WEKA (we cannot switch do Java, 
though ;-): 
http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4
and described by the inventors of the RF: 
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1

I am pretty sure something, similar is done in other classification 
algorithms, like naive bayes, where each feature is handled separately 
anyway and missing ones could just be omitted.

Regards,
Martin

Am 03.03.2023 08:33 schrieb Gael Varoquaux:
> Dear Martin,
> 
> From what I understand, you want a classifier that:
> 1. Is not based on imputation
> 2. Ignores whether a value is missing or not for the inference
> 
> It seems to me that those two requirements are in contradiction, and
> it is not clear to me how such a classifier would be theoretically
> grounded.
> 
> Best,
> 
> Gaël
> 
> On Thu, Mar 02, 2023 at 09:01:45AM +0000, Martin Gütlein wrote:
>> It would already help us, if someone could confirm that this is not 
>> possible
>> in sci-kit learn, because we are still not entirely sure that we have 
>> no
>> missed something.?
> 
>> Regards,
>> Martin
> 
>> Am 21.02.2023 15:48 schrieb Martin Gütlein:
>> > Hi,
> 
>> > I am looking for a classification model in python that can handle
>> > missing values, without imputation and "without learning from missing
>> > values", i.e. without using the fact that the information is missing
>> > for the inference.
> 
>> > Explained with the help of decision trees:
>> > * The algorithm should NOT learn whether missing values should go to
>> > the left or right child (like the HistGradientBoostingClassifier).
>> > * Instead it could built the prediction for each child node and
>> > aggregate these (like some Random Forest implementations do).
> 
>> > If that is not possible in sci-kit learn, maybe you have already
>> > discussed this? Or you know of a fork of sci-kit learn that is able to
>> > do this, or some other python library?
> 
>> > Any help would be really appreciated, kind regards,
>> > Martin
> 
> 
>> > P.S. Here is my use-case, in case you are interested: I have a binary
>> > classification problem with a positive and a negative class, and two
>> > types of features A and B. In my training data, I have a lot more data
>> > (90%) where B is missing. In my test data, I always have B, which is
>> > good because the B features are better than the A features. In the
>> > cases where B is present in the training data, the ratio of positive
>> > examples is much higher than when its missing. So what
>> > HistGradientBoostingClassifier does, it uses the fact that B is not
>> > missing in the test data, and predicts way too many positives.
>> > (Additionally, some feature values of type A are also often missing)
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn


More information about the scikit-learn mailing list