[scikit-learn] Outliers removal

Wed Apr 4 07:33:29 EDT 2018

Hello,

First, thanks for the fantastic scikit-learn library.

I have the following use case: For a classification problem, I have a
list of sentences and use word2vec and a method (eg. mean, or weigthed
mean, or attention and mean) to transform sentences to vectors. Because
my dataset is very noisy, I may come with sentences full of words that
are not part of word2vec, hence I can't vectorize them.

I would like to remove those sentences from my dataset X, but this would
mean removing also the corresponding target classes in y. Afaik,
scikit-learn does not implement this possibility. I've seen a couple of
issues about that, but they all seems stalled :
https://github.com/scikit-learn/scikit-learn/issues/9630,
https://github.com/scikit-learn/scikit-learn/issues/3855,
https://github.com/scikit-learn/scikit-learn/pull/4552,
https://github.com/scikit-learn/scikit-learn/issues/4143

I would like to be able to search for hyper-parameters in a simple way,
so I really would like to be able to use a single pipeline taking text
as input.

My actual conclusion is this one :

  * vectorizer should return None for bad samples (or a specific vector,
    like numpy.zeros, or add an extra column marking valid/invalid samples)
  * make all my transformers down the pipeline accept for those entries
    and leave them untouched (can be done with a generic wrapper class)
  * have a wrapper around my classifier, to avoid fitting on those, like
    jnothman suggested here
    https://github.com/scikit-learn/scikit-learn/issues/9630#issuecomment-325202441

Its a bit tedious, but I can see it working.

Is there any better suggestion ?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180404/a6b4cb14/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180404/a6b4cb14/attachment.sig>