[scikit-learn] Outliers removal

Alex Garel alex at garel.org
Wed Apr 4 07:33:29 EDT 2018


First, thanks for the fantastic scikit-learn library.

I have the following use case: For a classification problem, I have a
list of sentences and use word2vec and a method (eg. mean, or weigthed
mean, or attention and mean) to transform sentences to vectors. Because
my dataset is very noisy, I may come with sentences full of words that
are not part of word2vec, hence I can't vectorize them.

I would like to remove those sentences from my dataset X, but this would
mean removing also the corresponding target classes in y. Afaik,
scikit-learn does not implement this possibility. I've seen a couple of
issues about that, but they all seems stalled :

I would like to be able to search for hyper-parameters in a simple way,
so I really would like to be able to use a single pipeline taking text
as input.

My actual conclusion is this one :

  * vectorizer should return None for bad samples (or a specific vector,
    like numpy.zeros, or add an extra column marking valid/invalid samples)
  * make all my transformers down the pipeline accept for those entries
    and leave them untouched (can be done with a generic wrapper class)
  * have a wrapper around my classifier, to avoid fitting on those, like
    jnothman suggested here

Its a bit tedious, but I can see it working.

Is there any better suggestion ?

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180404/a6b4cb14/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180404/a6b4cb14/attachment.sig>

More information about the scikit-learn mailing list