[scikit-learn] [ANN] Scikit-learn 0.20.0

Andreas Mueller t3kcit at gmail.com
Tue Oct 2 11:46:01 EDT 2018

Thank you for your feedback Alex!

On 10/02/2018 09:28 AM, Alex Garel wrote:
>   * chunk processing (kind of handling streaming data) :  when dealing
>     with lot of data, the ability to fit_partial, then use transform
>     on chunks of data is of good help. But it's not well exposed in
>     current doc and API,
This has been discussed in the past, but it looks like no-one was 
excited enough about it to add it to the roadmap.
This would require quite some additions to the API. Olivier, who has 
been quite interested in this before now seems
to be more interested in integration with dask, which might achieve the 
same thing.
>   * and a lot of models do not support it, while they could.
Can you give examples of that?
>   * Also pipeline does not support fit_partial and there is not
>     fit_transform_partial.
What would you expect those to do? Each step in the pipeline might 
require passing over the whole dataset multiple times
before being able to transform anything. That basically makes the 
current interface impossible to work with the pipeline.
Even if only a single pass of the dataset was required, that wouldn't 
work with the current interface.
If we would be handing around generators that allow to loop over the 
whole data, that would work. But it would be unclear
how to support a streaming setting.

>   * while handling "Passing around information that is not (X, y)", is
>     there any plan to have transform being able to transform X and y ?
>     This would ease lots of problems like subsampling, resampling or
>     masking data when too incomplete.
An API for subsampling is on the roadmap :)

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20181002/2c5f4dc6/attachment-0001.html>

More information about the scikit-learn mailing list