[scikit-learn] Outliers removal
Alex Garel
alex at garel.org
Wed Apr 4 07:33:29 EDT 2018
Hello,
First, thanks for the fantastic scikit-learn library.
I have the following use case: For a classification problem, I have a
list of sentences and use word2vec and a method (eg. mean, or weigthed
mean, or attention and mean) to transform sentences to vectors. Because
my dataset is very noisy, I may come with sentences full of words that
are not part of word2vec, hence I can't vectorize them.
I would like to remove those sentences from my dataset X, but this would
mean removing also the corresponding target classes in y. Afaik,
scikit-learn does not implement this possibility. I've seen a couple of
issues about that, but they all seems stalled :
https://github.com/scikit-learn/scikit-learn/issues/9630,
https://github.com/scikit-learn/scikit-learn/issues/3855,
https://github.com/scikit-learn/scikit-learn/pull/4552,
https://github.com/scikit-learn/scikit-learn/issues/4143
I would like to be able to search for hyper-parameters in a simple way,
so I really would like to be able to use a single pipeline taking text
as input.
My actual conclusion is this one :
* vectorizer should return None for bad samples (or a specific vector,
like numpy.zeros, or add an extra column marking valid/invalid samples)
* make all my transformers down the pipeline accept for those entries
and leave them untouched (can be done with a generic wrapper class)
* have a wrapper around my classifier, to avoid fitting on those, like
jnothman suggested here
https://github.com/scikit-learn/scikit-learn/issues/9630#issuecomment-325202441
Its a bit tedious, but I can see it working.
Is there any better suggestion ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180404/a6b4cb14/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180404/a6b4cb14/attachment.sig>
More information about the scikit-learn
mailing list