<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Le 02/10/2018 à 16:46, Andreas Mueller

      a écrit :<br>

    </div>

    <blockquote type="cite"

      cite="mid:42b1846e-0552-4d39-51dc-498cc24fd126@gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      Thank you for your feedback Alex!<br>

    </blockquote>

    Thanks for answering !<br>

    <br>

    <blockquote type="cite"

      cite="mid:42b1846e-0552-4d39-51dc-498cc24fd126@gmail.com"> <br>

      <div class="moz-cite-prefix">On 10/02/2018 09:28 AM, Alex Garel

        wrote:<br>

      </div>

      <blockquote type="cite"

        cite="mid:cf272c39-96d3-9b11-fe8c-d931e0bd411a@garel.org">

        <meta http-equiv="Content-Type" content="text/html;

          charset=utf-8">

        <br>

        <ul>

          <li>chunk processing (kind of handling streaming data) :  when

            dealing with lot of data, the ability to fit_partial, then

            use transform on chunks of data is of good help. But it's

            not well exposed in current doc and API,</li>

        </ul>

      </blockquote>

      This has been discussed in the past, but it looks like no-one was

      excited enough about it to add it to the roadmap.<br>

      This would require quite some additions to the API. Olivier, who

      has been quite interested in this before now seems<br>

      to be more interested in integration with dask, which might

      achieve the same thing.<br>

    </blockquote>

    <br>

    I've tried to use Dask on my side, but for now, though going quite

    ahead, I didn't suceed completly because (in my specific case) of

    memory issues (dask default schedulers do not specialize processes

    on tasks, and I had some memory consuming tasks but I didn't get far

    enough to write my own scheduler). However I might deal with that

    later (not writing a scheduler but sharing memory with mmap, in this

    case).<br>

    But yes Dask is about the "chunk instead of really streaming"

    approach (which was my point).<br>

    <br>

    <blockquote type="cite"

      cite="mid:42b1846e-0552-4d39-51dc-498cc24fd126@gmail.com">

      <blockquote type="cite"

        cite="mid:cf272c39-96d3-9b11-fe8c-d931e0bd411a@garel.org">

        <ul>

          <li> and a lot of models do not support it, while they could.</li>

        </ul>

      </blockquote>

      Can you give examples of that? </blockquote>

    Hum I spoke maybe too fast ! Greping the code give me some example

    at least, and it's true that a DecisionTree does not hold it

    naturally !<br>

    <br>

    <blockquote type="cite"

      cite="mid:42b1846e-0552-4d39-51dc-498cc24fd126@gmail.com">

      <blockquote type="cite"

        cite="mid:cf272c39-96d3-9b11-fe8c-d931e0bd411a@garel.org">

        <ul>

          <li>Also pipeline does not support fit_partial and there is

            not fit_transform_partial.</li>

        </ul>

      </blockquote>

      What would you expect those to do? Each step in the pipeline might

      require passing over the whole dataset multiple times<br>

      before being able to transform anything. That basically makes the

      current interface impossible to work with the pipeline.<br>

      Even if only a single pass of the dataset was required, that

      wouldn't work with the current interface.<br>

      If we would be handing around generators that allow to loop over

      the whole data, that would work. But it would be unclear<br>

      how to support a streaming setting.<br>

    </blockquote>

    You're right, I didn't think hard enough about it !<br>

    <br>

    BTW I made some test using generators and making fit / transform

    build pipelines that I consumed latter on (tried with plain

    iterators and streamz). <br>

    It did work somehow, with much hacks, but in my specific case,

    performance where not good enough. (real problem was not framework

    performance, but my architecture where I realize, that constantly

    re-generating data instead of doing it once was not fast enough).<br>

    <br>

    So finally my points were not so good, but at least I did learn

    something ;-)<br>

    <br>

    Thanks for your time.<br>

    <br>

    <br>

    <pre class="moz-signature" cols="72">-- 

Alexandre Garel

tel : +33 7 68 52 69 07 / +213 656 11 85 10

skype: alexgarel / ring: ba0435e11af36e32e9b4eb13c19c52fd75c7b4b0

</pre>

  </body>

</html>