[scikit-learn] classification model that can handle missing values w/o learning from missing values
Guillaume Lemaître
g.lemaitre58 at gmail.com
Fri Mar 10 08:38:31 EST 2023
Hi Martin,
I think that you could use `imbalanced-learn` and a bit of Pandas/NumPy to
get the behaviour that you want.
You can use a `FunctionSampler` (
https://imbalanced-learn.org/stable/references/generated/imblearn.FunctionSampler.html)
in which you remove the sample containing missing values.
This process is only apply when calling `fit`. You will need to use the
`Pipeline` from imbalanced-learn` as well.
In some way, it seems that you want to resample the training set which what
the `Sampler` are intended for in `imbalanced-learn`.
Cheers,
On Fri, 10 Mar 2023 at 14:21, Martin Gütlein <guetlein at posteo.de> wrote:
> Hi Gaël,
>
> > [...] the other one that
> > amounts to imputation using a forest, and can be done in scikit-learn
> > by setting up the IteratuiveImputer to use forests as a base learner
> > (this will however be slow).
>
> The main difference is that when I use the IterativeImputer in
> scikit-learn, I still have to apply this imputation on the test set,
> before being able to predict with the RF. However, other implementations
> do not impute missing values, but instead split up the test instance.
>
> I made the experience that this makes a big difference, and you are able
> to use features where the majority of values is missing, and where at
> the same time the class ratio of the examples with missing values is
> largely different to those without missing values.
>
> Kind regards,
> Martin
>
>
>
>
>
> Am 03.03.2023 15:41 schrieb Gael Varoquaux:
> > On Fri, Mar 03, 2023 at 10:22:04AM +0000, Martin Gütlein wrote:
> >> > 2. Ignores whether a value is missing or not for the inference
> >> What I meant is rather, that the missing value should NOT be treated
> >> as
> >> another possible value of the variable (this is e.g., what the
> >> HistGradientBoostingClassifier implementation in sk-learn does).
> >> Instead,
> >> multiple predictions could be done when a split-attribute is missing,
> >> and
> >> those can be averaged.
> >
> >> This is how it is e.g. implemented in WEKA (we cannot switch do Java,
> >> though
> >> ;-):
> >>
> http://web.archive.org/web/20080601175721/http://wekadocs.com/node/2/#_edn4
> >> and described by the inventors of the RF:
> >>
> https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1
> >
> > The text that you link to describes two types of strategies, one that
> > is similar to that done in HistGradientBoosting, the other one that
> > amounts to imputation using a forest, and can be done in scikit-learn
> > by setting up the IteratuiveImputer to use forests as a base learner
> > (this will however be slow).
> >
> > Cheers,
> >
> > Gaël
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scikit-learn/attachments/20230310/8fa2cc7e/attachment.html>
More information about the scikit-learn
mailing list