[scikit-learn] [Scikit-learn-general] Estimator serialisability

Thu Jul 14 04:17:52 EDT 2016

This has been discussed numerous times. I suppose no one thinks supporting
pickle only is great, but a custom dict is unmaintainable. The best we've
got AFAIK (and it looks
<https://github.com/jpmml/jpmml-sklearn/graphs/contributors> like it's
getting better all the time) is a tool to convert one-way to PMML, which is
portable to production environments. See
https://github.com/jpmml/sklearn2pmml (python interface) and
https://github.com/jpmml/jpmml-sklearn(command-line interface and guts of
the thing).

I hope that helps; and thanks to Villu Ruusmann: that list of supported
estimators is awesome!

PS: please write to the new list at scikit-learn at python.org

On 14 July 2016 at 17:24, Miroslav Zoričák <miroslav.zoricak at gmail.com>
wrote:

> Hi everybody,
>
> I have been using scikit-learn for a while, but I have run into a problem
> that does not seem to have any good solutions.
>
> Basically I would like to:
> - build my pipeline in a Jupyter Notebook
> - persist it (to json or hdf5)
> - load it in production and execute the prediction there
>
> The problem is that for persisting estimators such as the RobustScaler for
> example, the recommended way is to pickle them. Now I don't want to do
> this, for three reasons:
>
> - Security, pickle is potentially dangerous
> - Portability, I can't unpickle it in scala for example
> - Pickle stores a lot of details and information which is not strictly
> necessary to reconstruct the RobustScaler and therefore might prevent it
> from being reconstructed correctly if a different version is used.
>
> Another option I would seem to have is to access the private members of
> each serialiser that I want to use and store them on my own, but this is
> inconvenient, because:
>
> - It forces me as a user to understand how the robust scaler works and how
> it stores its internal state, which is generally bad for usability
> - The internal implementation could change, leaving me to fix my
> serialisers (see #1)
> - I would need to do this for each new Estimator I decide to use
>
> Now, to me it seems the solution is quite obvious:
> Write a Mixin or update the BaseEstimator class to include two additional
> methods:
>
> to_dict() - will return a dictionary such, that when passed to
> from_dict(dictionary) - it will reconstruct the original object
>
> these dictionaries could be passed to the JSON module or the YAML module
> or stored elsewhere. We could provide more convenience methods to do this
> for the user.
>
> In case of the RobustScaler the dict would look something like:
> { "center": "0,0", "scale": "1.0"}
>
> Now the bulk of the work is writing these serialisers and deserialisers
> for all of the estimators, but that can be simplified by adding a method
> that could do that automatically via reflection and the estimator would
> only need to specify which fields to serialise.
>
> I am happy to start working on this and create a pull request on Github,
> but before I do that I wanted to get some initial thoughts and reactions
> from the community, so please let me know what you think.
>
> Best Regards,
> Miroslav Zoricak
> --
> Best Regards,
> Miroslav Zoricak
>
>
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and
> traffic
> patterns at an interface-level. Reveals which users, apps, and protocols
> are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning
> reports.http://sdm.link/zohodev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160714/438bdb00/attachment.html>