<html><head></head><body text="#000000" bgcolor="#FFFFFF" lang="en-GB" style="background-color: rgb(255, 255, 255); line-height: initial;">                                                                                      <div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);">You might use the new FunctionSampler from imblearn which will take your heuristic as input sample for you. </div><div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);"><br></div><div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);">‎http://contrib.scikit-learn.org/imbalanced-learn/dev/auto_examples/plot_outlier_rejections.html#sphx-glr-auto-examples-plot-outlier-rejections-py</div><div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);"><br></div><div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);">Is it compatible with imblearn pipeline (basically it handles sampler, apply the transform at fit time and does nothing at predict). </div><div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);"><br></div><div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);">Would it help?</div>                                                                                                                                     <div style="width: 100%; font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);"><br></div>                                                                                                                                                                                                   <div style="font-size: initial; font-family: Calibri, 'Slate Pro', sans-serif, sans-serif; color: rgb(31, 73, 125); text-align: initial; background-color: rgb(255, 255, 255);">Guillaume Lemaitre <br>INRIA Saclay Ile-de-France / Equipe PARIETAL<br>guillaume.lemaitre@inria.fr - https://glemaitre.github.io/</div>                                                                                                                                                                                  <table width="100%" style="background-color:white;border-spacing:0px;"> <tbody><tr><td colspan="2" style="font-size: initial; text-align: initial; background-color: rgb(255, 255, 255);">                           <div style="border-style: solid none none; border-top-color: rgb(181, 196, 223); border-top-width: 1pt; padding: 3pt 0in 0in; font-family: Tahoma, 'BB Alpha Sans', 'Slate Pro'; font-size: 10pt;">  <div><b>From: </b>Alex Garel</div><div><b>Sent: </b>Wednesday, 4 April 2018 13:35</div><div><b>To: </b>scikit-learn@python.org</div><div><b>Reply To: </b>Scikit-learn mailing list</div><div><b>Subject: </b>[scikit-learn] Outliers removal</div></div></td></tr></tbody></table><div style="border-style: solid none none; border-top-color: rgb(186, 188, 209); border-top-width: 1pt; font-size: initial; text-align: initial; background-color: rgb(255, 255, 255);"></div><br><div id="_originalContent" style="background-color: rgb(255, 255, 255);">

  
    <meta http-equiv="content-type" content="text/html; charset=utf-8">

  
    <p>Hello,</p>

    <p>First, thanks for the fantastic scikit-learn library.<br>

    </p>

    <p>I have the following use case: For a classification problem, I

      have a list of sentences and use word2vec and a method (eg. mean,

      or weigthed mean, or attention and mean) to transform sentences to

      vectors. Because my dataset is very noisy, I may come with

      sentences full of words that are not part of word2vec, hence I

      can't vectorize them.</p>

    <p>I would like to remove those sentences from my dataset X, but

      this would mean removing also the corresponding target classes in

      y. Afaik, scikit-learn does not implement this possibility. I've

      seen a couple of issues about that, but they all seems stalled :

      <a class="moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/issues/9630">https://github.com/scikit-learn/scikit-learn/issues/9630</a>,

      <a class="moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/issues/3855">https://github.com/scikit-learn/scikit-learn/issues/3855</a>,

      <a class="moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/pull/4552">https://github.com/scikit-learn/scikit-learn/pull/4552</a>,

      <a class="moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/issues/4143">https://github.com/scikit-learn/scikit-learn/issues/4143</a></p>

    <p>I would like to be able to search for hyper-parameters in a

      simple way, so I really would like to be able to use a single

      pipeline taking text as input.</p>

    <p>My actual conclusion is this one :</p>

    <ul>

      <li>vectorizer should return None for bad samples (or a specific

        vector, like numpy.zeros, or add an extra column marking

        valid/invalid samples)<br>

      </li>

      <li>make all my transformers down the pipeline accept for those

        entries and leave them untouched (can be done with a generic

        wrapper class)<br>

      </li>

      <li>have a wrapper around my classifier, to avoid fitting on

        those, like jnothman suggested here

<a class="moz-txt-link-freetext" href="https://github.com/scikit-learn/scikit-learn/issues/9630#issuecomment-325202441">https://github.com/scikit-learn/scikit-learn/issues/9630#issuecomment-325202441</a></li>

    </ul>

    <p>Its a bit tedious, but I can see it working.</p>

    <p>Is there any better suggestion ?</p>

  
<br><!--end of _originalContent --></div></body></html>