[scikit-learn] [ANN] Scikit-learn 0.20.0

Wed Oct 3 04:18:47 EDT 2018

Le 02/10/2018 à 16:46, Andreas Mueller a écrit :
> Thank you for your feedback Alex!
Thanks for answering !

>
> On 10/02/2018 09:28 AM, Alex Garel wrote:
>>
>>   * chunk processing (kind of handling streaming data) :  when
>>     dealing with lot of data, the ability to fit_partial, then use
>>     transform on chunks of data is of good help. But it's not well
>>     exposed in current doc and API,
>>
> This has been discussed in the past, but it looks like no-one was
> excited enough about it to add it to the roadmap.
> This would require quite some additions to the API. Olivier, who has
> been quite interested in this before now seems
> to be more interested in integration with dask, which might achieve
> the same thing.

I've tried to use Dask on my side, but for now, though going quite
ahead, I didn't suceed completly because (in my specific case) of memory
issues (dask default schedulers do not specialize processes on tasks,
and I had some memory consuming tasks but I didn't get far enough to
write my own scheduler). However I might deal with that later (not
writing a scheduler but sharing memory with mmap, in this case).
But yes Dask is about the "chunk instead of really streaming" approach
(which was my point).

>>   * and a lot of models do not support it, while they could.
>>
> Can you give examples of that? 
Hum I spoke maybe too fast ! Greping the code give me some example at
least, and it's true that a DecisionTree does not hold it naturally !

>>   * Also pipeline does not support fit_partial and there is not
>>     fit_transform_partial.
>>
> What would you expect those to do? Each step in the pipeline might
> require passing over the whole dataset multiple times
> before being able to transform anything. That basically makes the
> current interface impossible to work with the pipeline.
> Even if only a single pass of the dataset was required, that wouldn't
> work with the current interface.
> If we would be handing around generators that allow to loop over the
> whole data, that would work. But it would be unclear
> how to support a streaming setting.
You're right, I didn't think hard enough about it !

BTW I made some test using generators and making fit / transform build
pipelines that I consumed latter on (tried with plain iterators and
streamz).
It did work somehow, with much hacks, but in my specific case,
performance where not good enough. (real problem was not framework
performance, but my architecture where I realize, that constantly
re-generating data instead of doing it once was not fast enough).

So finally my points were not so good, but at least I did learn
something ;-)

Thanks for your time.

-- 
Alexandre Garel
tel : +33 7 68 52 69 07 / +213 656 11 85 10
skype: alexgarel / ring: ba0435e11af36e32e9b4eb13c19c52fd75c7b4b0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181003/a3c0010e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181003/a3c0010e/attachment.sig>